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This  paper  explores  issues  involved  in  implementing  robot 
learning  for  a  chdlenging  dynamic  task,  using  a  case  study 
from  robot  juggling.  We  use  a  memory-based  local  model¬ 
ing  approach  (locally  weighted  regression)  to  represent  a 
learned  model  of  the  task  to  be  performed.  Statistical  tests 
^e  given  to  examine  the  uncertainty  of  a  model,  to  optimize 
its  prediction  quality,  and  to  deal  with  noisy  and  corrupted 
data.  We  develop  an  exploration  algorithm  that  explicitly 
deals  with  prediction  accuracy  requirements  during  explo¬ 
ration.  Using  all  these  ingredients  in  combination  with 
methods  from  optimal  control,  our  robot  achieves  fast  real¬ 
time  learning  of  the  task  within  40  to  100  trials. 
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Introduction 

Learning  control  means  improving  a  motor  skill  by  repeatedly  practicing  a  task.  There  has 
been  much  progress  in  learning  control  research.  But  many  projects  test  proposed  algo¬ 
rithms  only  in  simulation.  We  have  found  that  actual  implementation  of  learning  control 
forces  us  to  consider  issues  not  adequately  addressed  in  simulations.  In  this  paper  we  de¬ 
scribe  which  ingredients  were  needed  to  actually  implement  a  learning  algorithm  on  a 
robot  for  a  complicated  dynamic  task. 

We  are  exploring  systems  that  learn  by  explicitly  remembering  their  experiences  in 
order  to  build  models  of  the  world.  The  learning  community  distinguishes  between  two  dif¬ 
ferent  methods  to  represent  a  model,  parametric  and  nonparametric.  A  parametric  model 
consists  of  a  certain  mathematical  function  which  possesses  a  finite  set  of  free  parameters 
that  have  to  be  determined  to  make  the  function  fit  the  data.  This  function  models  all  data 
simultaneously,  which  means  that  parametric  models  correspond  to  global  function  fitting. 
Parametric  models  and  training  methods  often  do  not  remember  the  data  they  were  trained 
on.  Standard  linear  regression,  sigmoidal  neural  networks,  radial  basis  function  networks, 
etc.,  belong  in  this  class  of  techniques.  Nonparametric  models  also  have  an  underlying 
function  with  a  set  of  parameters  which  are  to  be  adjusted.  However,  the  number  of  the  pa-? 
rameters  can  grow  with  the  amount  of  data  and  the  parameters  can  be  recalculated  when-- 
ever  the  model  is  used  to  generate  an  output  from  a  new  input  (a  process  which  is  also 
called  a  lookup  or  query).  This  makes  sense  if  not  all  data  is  taken  into  account  to  estimate 
the  parameters  but  merely  a  subset,  or  if  individual  data  points  are  weighted  differently 
with  respect  to  different  query  points.  Common  algorithms  to  choose  the  subset  or  the 
weighting  are,  for  example,  n-nearest  neighbor  methods  (e.g.,  [12, 25])  or  kernel  regression 
(e.g.,  [19,  28,  37]).  Common  functions  are  (hyper-)planes  or  (hyper-)quadratic  surfaces.  By 
letting  only  a  few  data  points  contribute  to  forming  the  parameters,  these  types  of  non¬ 
parametric  models  correspond  to  local  function  fitting:  they  build  a  local  model  to  fit  a 
subset  of  data  points  with  their  function.  As  the  word  “local”  implies,  the  model  will  be 
valid  only  in  a  restricted  region.  Due  to  the  necessity  of  continuous  recalculation  of  the  pa¬ 
rameters  for  each  individual  query,  local  nonparametric  models  have  to  memorize  all  data 
and  are  often  called  memory-based.  Weighted  averaging  and  nearest  neighbor  methods  are 
presumably  the  best  known  nonparametric  approaches. 

We  are  investigating  a  recently  developed  nonparametric  (memory-based)  statistical 
technique,  locally  weighted  regression  (LWR),  to  model  the  system  we  are  trying  to  con¬ 
trol  [11,  15,  16].  The  LWR  approach  allows  us  to  efficiently  estimate  local  linear  models 
for  different  points  in  the  state  space.  LWR  offers  a  variety  of  statistical  tools  to  assess  the 
reliability  of  lookups,  to  optimize  the  quality  of  a  lookup,  and  also  to  cope  with  noise  and 
corrupted  data.  This  allows  the  robot  to  monitor  its  own  skill  level,  and  it  provides  the  ba- 


2 


sis  for  an  exploratory  behavior  that  is  almost  entirely  driven  by  the  stream  of  incoming  data 
from  practicing  the  task. 

Our  starting  point  for  modeling  is  that  we  assume  knowledge  of  what  constitutes  a 
state  of  the  system,  i.e.,  the  input/output  representations,  but  the  form  of  the  dynamics 
equations  of  the  task  to  be  controlled  is  unknown.  Past  work  tested  our  ideas  by  imple¬ 
menting  learning  for  one-shot  or  static  tasks,  such  as  throwing  a  ball  at  a  target  [1],  and 
also  repetitive  or  dynamic  tasks,  such  as  bouncing  a  ball  on  a  paddle  [2]  and  hitting  a  stick 
back  and  forth  (a  form  of  juggling  known  as  devil  sticking)  [40].  This  as  well  as  other  ex¬ 
perimental  work  (e.g.,  [32])  has  highlighted  the  importance  of  making  sure  the  control 
paradigm  used  is  robust  to  uncertainty,  that  the  robot  is  able  to  compute  what  is  known 
about  the  task,  and  how  well  it  is  known,  and  that  there  is  some  process  that  generates  ex¬ 
ploration,  so  that  models  and  controllers  based  on  insufficient  data  are  improved.  All  these 
points  are  addressed  by  the  LWR  learning  algorithm.  Using  our  work  with  the  devil  stick¬ 
ing  robot  as  an  example,  this  paper  describes  what  was  needed  to  implement  real-time 
learning  based  on  this  algorithm. 

The  next  section  of  this  paper  discusses  a  number  of  control  approaches  which  make 
use  of  models  and  motivates  the  choice  in  our  work.  Locally  weighted  regression  and  somel 
of  its  statistical  tools  are  introduced  afterwards.  Exploration,  a  key  feature  for  system  iden-  ' 
tification  of  modeling  approaches,  receives  attention  in  the  fourth  section  where  we  intro- 
duce  a  goal-directed  exploration  algorithm  which  keeps  explicit  control  over  prediction  ac¬ 
curacy  during  exploration.  In  the  fifth  section,  the  previously  introduced  methods  are  find 
application  in  a  real-time  implementation  of  learning  how  to  juggle  the  devil  stick. 

Control  Paradigms 

Before  discussing  the  details  of  our  representational  approach,  it  is  useful  to  consider  some 
of  the  alternative  control  paradigms  that  might  make  use  of  learned  models. 

Deadbeat  Control 

In  considering  repetitive  or  dynamic  tasks,  we  will  focus  on  nonlinear  regulator  design,  as¬ 
suming  there  is  a  desired  state  Xj  ^  to  achieve.  Since  often  the  observations  of  system  in¬ 
puts  and  outputs  occur  at  discrete  time  intervals  and  not  all  the  derivatives  of  the  state  are 
typically  measured,  we  restrict  our  analysis  to  discrete  time  models.  The  notation  for  the 
forward  dynamics  model  of  a  discrete  system  is 


Out  notation  has  the  following  conventions:  scalars  are  denoted  by  lower  cases  letters  in  italic  face  (e.g.,  s),  vectors 
^e  denoted  by  lower  case  letters  in  bold  face  (e.g.,  v),  matrices  are  denoted  by  upper  case  letters  in  bold  face  (e.g. 
M),  scalar  valued  function  are  in  italic  face  (e.g.,  fO),  vector  valued  function  are  in  bold  face  (e.g.,  f()),  and  ()^ 
denotes  the  transpose  of  a  vector  or  matrix,  whereby  all  vectors  are  originally  column  vectors.  The  ''  (caret) 
models  and  predictions  by  models.  Dots  on  top  of  variables  indicate  time  derivatives. 
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X,,,  =f(x„uj  (2.1) 

and  attempts  to  perform  the  task  generate  experience  vectors  (x[,u[,x[^,)^.  A  straightfor¬ 
ward  approach  to  improving  performance  on  the  task  is  to  learn  an  inverse  model 

u,  =r‘(x*,x,„)  (2.2) 

from  the  database  of  experiences  and  use  the  model  to  predict  commands  for  later  attempts 
of  the  task  by  replacing  x^^,  in  (2.2)  by  a  desired  state  Ti.k^x.dtsired-  Another  approach  is  to 
learn  the  forward  model  (2.1)  and  then  search  for  a  good  command,  minimizing  the  (by  Q 
and  R  weighted)  squared  magnitude  of  the  predicted  state  error  and  the  command: 

(2.3) 

Eq.(2.2)  and  Eq.(2.3)  with  R  =  0  correspond  to  deadbeat  control. 

The  deadbeat  controllers  above  did  not  achieve  satisfying  robustness  in  our  work 
since  they  attempt  to  cancel  the  plant  dynamics  entirely.  A  less  aggressive  nonlinear  con¬ 
trol  approach  is  to  locally  linearize  the  system  about  the  desired  point,  and  then  use  one  of 
the  many  linear  controller  design  techniques,  e.g.,  pole  placement,  linear  quadratic  (LQ),  oii 
Hoc.  Such  an  approach  is  very  successful  if  the  system  remains  within  the  linear  region.  - 

Representing  the  Forward  Model 

Modeling  approaches  require  model  representations.  If  the  nonlinear  system  has  a  particu¬ 
lar  structure,  it  can  be  globally  linearized  using  nonlinear  coordinate  transformations  and 
state  feedback  (feedback  linearization)  [36].  Any  linear  control  design  techniques  may  be 
used  subsequently.  Much  of  the  recent  work  in  adaptive  controllers  for  nonlinear  systems 
assumes  some  knowledge  of  the  form  of  the  nonlinearities  and  the  plant’s  unknown  pa¬ 
rameters  [30].  A  common  formulation  requires  the  plant  be  representable  accurately  by  a 
feedback  linearizable  model  in  which  all  unknown  elements  appear  linearly  as  a  parameter 
vector.  A  more  black  box  approach  to  adaptive  control  [21]  is  to  use  a  form  of  parametric 
Vplterra  series  in  the  inputs  and  states.  Single  hidden  layer  perceptron-like  neural  network 
models  essentially  project  the  input  data  along  a  line  given  by  the  input  weights,  and  then 
output  a  one  dimensional  function  of  the  value  of  that  projection.  Radial  basis  function 
networks  use  centers  of  spherically  symmetric  contributions  from  parameterized  one 
dimensional  functions  applied  to  the  distance  between  each  input  and  the  center.  All  these 
approaches  make  implicit  assumptions  about  the  form  of  the  system  they  are  interacting 
with,  which  we  want  to  avoid,  as  will  be  demonstrated  in  the  next  section. 

Optimal  Control  Approaches 

Learning  approaches  that  do  not  commit  to  a  particular  representational  form  generate  nu¬ 
merical  representations,  for  which  optimal  control  techniques  provide  natural  methods  to 
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design  control  systems  for  nonlinear  tasks.  Dynamic  programming  [8,  9,  13]  lays  the  basis 
for  a  general  paradigm  of  nonlinear  controllers.  In  our  formulation  of  the  regulation  prob¬ 
lem,  a  goal  state  is  given,  which  is  typically  an  equilibrium  state,  so  =  f(x^,0).  A 
one  step  cost  L(x,u)  is  defined  over  all  states  and  controls.  The  criterion  to  be  optimized  is 
the  infinite  horizon  sum  of  one  step  costs  starting  at  the  current  time; 

=  (2.4) 

k=\ 

We  typically  require  either  a  temporal  discount  factor  or  L(xj,0)  =  0  to  ensure  well  de¬ 
fined  solutions  to  the  optimization  problem.  The  value  function  V(x)  is  the  optimal  cost 
created  by  solving  (2.4)  starting  in  state  x.  At  any  point,  a  globally  optimal  control  action 
can  be  chosen  by  the  nonlinear  controller  by  solving  the  local  optimization  problem: 


=  argimn[L(x,u)  +  V(/(x,u))].  (2.5) 

If  one  assumes  a  locally  linear  model  of  the  plant, 

=fK.u*)*Ax*+Bu*+c,  (2.6) 

a  weighted  locally  quadratic  model  of  the  one  step  cost,  f 

^(x.“)  =  :^x^Qx  +  ^u^Ru  +  x^Su-i-t^u,  (2.7) ' 

and  a  locally  quadratic  model  of  the  value  function, 

Vo  + V^x  +  ^x^K„x,  (2.8) 

one  can  compute  a  locally  optimal  command  analytically: 

u""'  =  -(R  -h  (B'' Ax  +  S^’x  -t-  B^V^c  +  V,B  + 1) .  (2.9) 


Unfortunately,  value  functions  are  difficult  to  represent  and  to  compute,  even  though 
this  can  be  done  off-line.  Predictive  control  design  techniques  avoid  using  a  value  function, 
but  are  then  merely  locally  optimal  [10].  Value  functions  can  also  be  approximated,  e.g., 
with  neural  networks  [42].  We  are  interested  in  exploring  approximations  to  value  func¬ 
tions  that  produce  a  locally  quadratic  model  of  V(x)  in  a  local  neighborhood  of  x  . 

In  this  paper  we  are  working  within  an  optimal  control  framework.  We  would  like  to 
design  a  fully  nonlinear  controller  from  a  full  computation  of  the  optimal  value  function. 
This  is  currently  too  expensive  to  compute,  so  we  use  linear  quadratic  (LQ)  regulator  tech¬ 
niques  to  approximate  the  value  function  and  design  a  corresponding  controller.  We  make 
extensive  use  of  local  linear  models  of  the  system  to  be  controlled.  The  linearized  models 
are  calculated  on  an  as  needed  basis  and  are  recalculated  with  each  new  piece  of  data  to 
update  the  controller.  All  of  this  happens  in  real  time  as  the  robot  is  executing  the  task. 
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Locally  Weighted  Regression 

The  point  of  view  explored  in  this  paper  is  that  the  goal  of  a  learning  system  for  robots  is 
to  be  able  to  build  internal  models  of  tasks  during  execution  of  those  tasks.  These  models 
are  multidimensional  functions  that  are  approximated  from  sampled  data  (the  previous  ex¬ 
periences  or  attempts  to  perform  the  task).  The  learned  models  are  used  in  a  variety  of 
ways  to  successfully  execute  the  task.  We  would  like  the  models  to  incorporate  the  latest 
information.  The  models  will  be  continuously  updated  with  a  stream  of  new  training  data, 
so  updating  a  model  with  new  data  should  take  a  short  period  of  time.  There  are  also  time 
constraints  on  how  long  it  can  take  to  use  a  model  to  make  a  prediction.  Because  we  are 
interested  in  control  methods  that  make  use  of  local  linearizations  of  the  plant  model,  we 
want  a  representation  that  can  quickly  compute  a  local  linear  model  of  the  represented 
transformation.  We  would  also  like  to  minimize  the  negative  interference  from  learning 
new  knowledge  on  previously  stored  information. 


Figure  1:  Characteristic  performance  of  three  different  nonparametric  function  approximation  tech¬ 
niques:  (a)  nearest  neighbor;  (b)  weighted  average;  (c)  locally  weighted  regression 

As  the  most  generic  approximator  that  satisfies  many  of  these  criteria,  we  explore  a 
version  of  memory-based  learning  techniques  called  locally  weighted  regression  (LWR). 
[l5,  16,  11,  6,  24,  14,  27].  A  memory-based  learmng  (MBL)  system  is  trained  by  storing 
the  training  data  in  a  memory.  This  allows  MBL  systems  to  achieve  real-time  learning. 
MBL  avoids  interference  between  new  and  old  data  by  retaining  and  using  all  the  data  to 
answer  each  query.  MBL  approximates  complex  functions  using  simple  local  models,  as 
does  a  Taylor  series.  Examples  of  types  of  local  models  include  nearest  neighbor,  weighted 
average,  and  locally  weighted  regression.  Each  of  these  local  models  combine  points  near 
to  a  query  point  to  estimate  the  appropriate  output.  Figure  1  shows  typical  curve  fits  for 
each  of  these  methods. 

Locally  weighted  regression  uses  a  relatively  complex  regression  procedure  to  form 
the  local  model,  and  is  thus  more  expensive  than  nearest  neighbor  and  weighted  average 
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memory-based  learning  procedures.  For  each  query  a  new  local  model  is  formed.  The  rate 
at  which  local  models  can  be  formed  and  evaluated  limits  the  rate  at  which  queries  can  be 
answered.  This  paper  describes  how  locally  weighted  regression  can  be  implemented  in 
real  time. 


An  unweighted  regression  finds  the  solution  to  the  equations: 

y  =  X-P  (3.1a) 

by  solving  the  normal  equations: 

X"Xp  =  XV.  (3.1.b) 

where  X  is  an  mx{n  +  \)  matrix  consisting  of  m  data  points,  each  represented  by  its  n  in¬ 
put  dimensions  and  a  “1”  in  the  last  column,  y  is  a  vector  of  corresponding  outputs  for 
each  data  point,  P  is  the  n  -i- 1  vector  of  unknown  regression  parameters,  and  J  is  the  sum 
of  squared  errors  over  all  given  data  points  (cf.  Table  1,  Appendix  A).  Solving  for  p  yields 

P  =  (X^X)-‘XV.  (3.2) 

and  a  prediction  of  the  outcome  of  a  query  point  becomes: 

(3.3)| 

However,  this  gives  distant  points  equal  influence  with  nearby  points  on  the  ultimate  ans-r 
wer  to  the  query,  for  equally  spaced  data.  To  weight  similar  points  more,  locally  weighted 
regression  is  used.  First,  a  distance  is  calculated  from  each  of  the  stored  data  points  (rows 
in  the  X  matrix)  to  the  query  point  : 


d} 


V  qj 


(3.4) 


The  factor  Sj  reflects  a  positive  weighting  (distance  metric)  among  the  n  input  dimensions, 
either  to  normalize  those  or  to  give  them  different  importance.  The  weight  for  each  stored 
data  point  is  a  function  of  the  distance  (3.4): 


(3.5) 

Each  row  i  of  X  and  y  is  multiplied  by  the  corresponding  weight  w,..  A  simple  weighting 
function  just  raises  the  distance  (3.4)  to  a  negative  power,  which  determines  how  local  the 
regression  will  be  (the  rate  of  drop-off  of  the  weights  with  distance): 


(3.6) 


This  type  of  weighting  function  goes  to  infinity  as  the  query  point  approaches  a  stored  data 
point  which  forces  the  locally  weighted  regression  to  exactly  match  that  stored  point.  If  the 
data  is  noisy,  exact  interpolation  is  not  desirable,  and  a  weighting  scheme  with  limited 
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magnitude  is  more  appropriate.  One  such  scheme,  which  we  use  in  what  follows,  is  a 
Gaussian  kernel: 


w,.  =  exp 


2k^ 


(3.7) 


The  parameter  k  scales  the  size  of  the  kernel  to  determine  how  local  the  regression  will  be. 
Such  a  weighting  is  used  in  Figure  lb  and  Figure  Ic. 

A  potential  problem  is  that  the  data  points  may  be  distributed  in  such  a  way  as  to 
make  the  regression  matrix  X  singular.  Ridge  regression  is  used  to  prevent  problems  due 
to  a  singular  data  matrix.  The  following  equation,  with  X  and  y  already  weighted  is 
solved  for  P : 


(X"X  +  A)P  =  XV.  (3.8) 

Where  A  is  a  diagonal  matrix  with  small  positive  diagonal  elements  Af .  Ridge  regression 
is  equivalent  to  adding  fake  data  in  each  direction  that  has  a  small  weight  and  a  zero  output 
value.  The  ridge  regression  constants  can  also  be  thought  of  as  Bayesian  priors  on  the  vari¬ 
ance  of  the  estimated  parameter  vector  P . 

Assessing  the  computational  cost 

A  lookup  in  a  LWR  model  has  three  stages;  forming  weights,  forming  the  regression  ma¬ 
trix,  and  solving  the  normal  equations.  Let  us  examine  how  the  cost  of  each  of  these  stages 
grows  with  the  size  of  the  data  set  and  dimensionality  of  the  problem.  We  will  assume  a 
linear  local  model. 


Forming  and  applying  the  weights  involves  scanning  the  entire  data  set,  so  it  scales 
linearly  with  the  number  of  data  points  in  the  database  m .  For  each  of  n  input  dimensions 
there  are  a  constant  number  of  operations,  so  the  number  of  operations  scales  linearly  with 
the  number  of  input  dimensions.  Note  that  we  can  elimmate  points  whose  distance  exceeds 
a  threshold,  reducing  the  number  of  points  considered  in  subsequent  computational  stages. 

Each  element  of  X^X  and  X^y  is  the  inner  (dot)  product  of  two  columns  of  X  or  y. 
The  architecture  of  digital  signal  processors  is  ideally  suited  for  this  computation,  which 
consists  of  repeated  multiplies  and  accumulates.  The  computation  is  linear  in  the  number 
of  rows  m  and  quadratic  in  the  number  of  columns  {n^+n*o),  where  o  is  the  number  of 
output  dimensions. 

Solving  the  normal  equations  is  done  using  a  LDlJ  decomposition,  which  is  cubic  in 
the  number  of  input  dimensions,  and  independent  of  the  number  of  data  points.  Other  more 
sophisticated  and  more  expensive  decompositions,  such  as  the  singular  value  decomposi¬ 
tion,  are  unnecessary  since  the  ridge  regression  procedure  guarantees  well-conditioned 
normal  equations. 
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The  most  straightforward  parallel  implementation  of  LWR  would  distribute  the  data 
points  among  several  processors.  Queries  can  be  broadcast  to  the  processors,  and  each  pro¬ 
cessor  can  weight  its  data  set  and  form  its  contribution  to  X^'X  and  X^.  These  contribu¬ 
tions  can  be  summed  and  the  full  normal  equations  solved  on  a  single  processor.  The 
communication  costs  are  linear  in  the  number  of  processors,  quadratic  in  the  number  of 
columns  (n‘  +n*o),  and  independent  of  the  total  number  of  points. 

We  have  implemented  the  local  weighted  regression  procedure  on  a  33MHz  Intel 
i860  microprocessor.  The  peak  computation  rate  of  this  processor  is  66  MFlops.  We  have 
achieved  effective  computation  rates  of  20  MFlops  on  a  learning  problem  with  n  =  10  in¬ 
put  dimensions  and  o  =  5  output  dimensions,  using  a  linear  local  model.  This  leads  to  a 
lookup  time  of  approximately  15  milliseconds  on  a  database  of  m  =  1000  points. 

Tuning  The  Fit  Parameters 

In  the  past  we  have  used  off-line  global  cross  validation  ([41]  to  estimate  reasonable  values 
for  the  fit  parameters:  the  distance  metric  Sj,  the  parameters  that  define  the  weighting  func¬ 
tion  w,.  =  f{df),  and  the  ridge  regression  parameters  Xj.  Since  we  are  using  a  local  model , 
that  is  linear  in  the  unknown  parameters,  we  can  compute  derivatives  of  the  cross  valida-i 
tion  error  e-  =  y.  -  y.  with  respect  to  the  fit  parameters:  ' 

de-  de-  de^ 
dsj  ’  dk  ’  dXj ' 

and  minimize  the  sum  of  the  squared  cross  validation  error  using  a  Levenberg-Marquardt 
(nonlinear  least  squares)  procedure  (MINPACK,  NL2SOL). 

However,  it  is  clear  that  these  parameters  should  depend  on  the  location  of  the  query 
point.  In  this  section  we  describe  new  procedures  that  estimate  local  values  of  the  fit  pa¬ 
rameters  optimized  for  the  site  of  the  current  query  point.  We  want  to  demonstrate  the  dif¬ 
ferences  between  local  and  global  fitting  in  an  example  where  we  only  focus  on  the  kernel 
width  k  of  a  Gaussian  weighting  function  (3.7).  In  Figure  2a,  a  noisy  data  set  of  the  func¬ 
tion  y  =  X  —  sin  (27tx  )  cos(2/d:^)  exp(x^)  was  fitted  by  locally  weighted  regression  with  a 
globally  optimized,  i.e.  constant,  k .  In  the  left  half  of  the  plot,  the  regression  starts  to  fit 
noise  because  k  had  to  be  rather  small  to  fit  the  high  frequency  regions  on  the  right  half  of 
the  plot.  The  prediction  intervals,  which  will  be  introduced  below,  indicate  high  uncer¬ 
tainty  in  several  places.  To  avoid  such  undesirable  behavior,  a  local  optimization  criterion 
is  needed.  Standard  linear  regression  analysis  provides  a  series  of  well-defined  statistical 
tools  to  assess  the  quality  of  fits,  such  as  coefficients  of  determination,  t-tests,  F-test,  the 
PRESS-statistic,  Mallow’s  Cp-test  ([23],  confidence  intervals,  prediction  intervals,  and 
many  more  (e.g.,  [29]).  These  tools  can  be  adapted  to  locally  weighted  regression.  We  do 
not  want  to  discuss  all  possible  available  statistics  here  but  rather  focus  on  two  that  have 
proved  to  be  useful. 
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Cross  validation  has  a  relative  in  linear  regression  analysis,  the  PRESS  residual  error. 
The  PRESS  statistic  performs  leave-one-out  cross  validation  computationally  very  efficient 
by  not  requiring  recalculation  of  the  regression  parameters  for  every  excluded  point.  Table 
1  in  Appendix  A  shows  how  the  PRESS  residual  can  be  expressed  as  a  mean  squared  cross 
validation  error  In  Figure  2b,  the  same  data  as  in  Figure  2a  was  fitted  by  adjust¬ 
ing  k  to  minimize  at  each  query  point.  The  outcome  is  much  smoother  than  that 

of  global  cross  validation,  and  also  the  prediction  intervals  are  narrower.  It  should  be  noted 
that  the  extrapolation  properties  on  both  sides  of  the  graph  are  quite  appropriate  (compared 
to  the  known  underlying  function),  in  comparison  to  Figure  2a  and  Figure  2c. 


Figure  3:  Influence  of  outliers  on  LWR;  (a)  no  outlier  removal,  (b)  with  outlier  removed 

Prediction  intervals  are  expected  bounds  of  the  prediction  error  at  a  query  point 
x^.  Table  1  gives  the  appropriate  definition  for  LWR;  its  derivation  can  be  found  in  most 
text  books  on  regression  analysis  (e.g.,  [29]).  Besides  using  the  intervals  to  assess  the  con¬ 
fidence  in  the  fit  at  a  certain  point,  they  provide  another  optimization  measure.  Figure  2c 
demonstrates  the  result  when  applying  this  statistic  for  optimizing  k  at  each  query  point. 
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Again,  the  fitted  curve  is  significantly  smoother  than  the  global  cross  validation  fit.  A 
rather  interesting  and  also  typical  effect  happens  at  the  very  right  end  of  the  plot.  When 
starting  to  extrapolate,  the  prediction  intervals  suddenly  favor  a  global  regression  instead  of 
the  local  regression,  i.e.,  the  k  was  chosen  to  be  rather  large.  It  turns  out  that  in  local  opti¬ 
mization  one  always  finds  a  competition  between  local  and  global  regression.  But  sudden 
jumps  from  one  mode  into  the  other  take  place  only  when  the  prediction  intervals  are  so 
large  that  the  data  is  not  trustworthy  anyway. 


Assessing  The  Quality  of  the  Local  Model 

Both  the  local  cross  validation  error  and  the  prediction  interval  7^  may  serve  to 

assess  the  quality  of  the  local  fit: 


The  factor  c  makes  Qj,^  dimensionless  and  normalizes  it  with  respect  to  some  user  defined 
quantity.  In  our  applications,  we  usually  preferred  based  on  the  prediction  intervals, 
which  is  the  more  conservative  assessment. 


Dealing  with  Outliers 

Linear  regression  analysis  is  not  robust  with  respect  to  outliers.  This  also  holds  for  locally 
weighted  regression,  although  the  influence  of  outliers  will  not  be  noticed  unless  the  out¬ 
liers  lie  close  enough  to  a  query  point.  In  Figure  3a  we  added  three  outliers  to  the  test  data 
of  Figure  2  to  demonstrate  this  effect;  the  charts  in  Figure  2  should  be  compared  to  Figure 
2c.  [27]  applied  the  median  absolute  deviation  procedure  from  robust  statistics  [18]  to 
globally  remove  outliers  in  LWR.  We  would  like  to  localize  our  criterion  for  outlier  re¬ 
moval.  The  PRESS  statistic  can  be  modified  to  serve  as  an  outlier  detector  in  LWR.  For 
this,  we  need  the  standardized  individual  PRESS  residual  e,.  (see  Table  1,  Appendix  A). 
This  measure  has  zero  mean  and  unit  variance.  If,  for  a  given  data  point  x,.,  it  deviates 
from  zero  more  than  a  certain  threshold,  the  point  can  be  called  an  outlier.  A  conservative 
threshold  would  be  1.96,  discarding  all  points  lying  outside  the  95%  area  of  the  normal  dis¬ 
tribution.  In  our  applications,  we  used  2.57  cutting  off  all  data  outside  the  99%  area  of  the 
normal  distribution.  As  can  be  seen  in  Figure  3b,  the  effects  of  outliers  is  reduced. 

The  Shifting  Setpoint  Exploration  Algorithm 

Learning  algorithms  which  assume  no  a  priori  structure  of  the  world  often  face  the  problem 
of  sparse  data  in  high  dimensional  spaces.  Random  exploration  in  order  to  build  models  of 
such  worlds  will  take  a  very  long  time.  Random  exploration  in  an  unknown  world  may  also 
cause  the  system  to  enter  unsafe  or  costly  regions  of  operation.  We  want  to  develop  an  ex¬ 
ploration  algorithm  which  explicitly  deals  with  such  problems. 


The  shifting  setpoint  algorithm  (SSA)  attempts  to  decompose  the  control  problem 
into  two  separate  control  tasks  on  different  time  scales.  At  the  fast  time  scale,  it  acts  as  a 
nonlinear  regulator  by  trying  to  keep  the  controlled  system  at  some  chosen  setpoints.  On  a 
slower  time  scale,  the  setpoints  are  shifted  to  accomplish  a  desired  goal.  The  SSA  tries  to 
explore  the  world  by  going  to  the  fringes  of  its  data  support  in  the  direction  of  the  goal.  It 
sets  the  setpoints  in  the  fringes  until  statistically  sufficient  data  has  been  collected  to  make 
a  further  step  towards  the  goal.  In  this  way  the  SSA  builds  a  narrow  tube  of  data  support  in 
which  it  knows  the  world.  This  data  can  be  used  by  more  sophisticated  control  algorithms 
for  planning  or  further  exploration. 


We  want  to  graphically  illustrate  the  algorithm  in  a  simple  example  of  a  mountain  car 
(Figure  4)  [26].  The  task  of  the  car  is  to  drive  at  a  given  constant  horizontal  speed 
from  the  left  to  the  right  of  the  picture.  need  not  be  met  precisely;  the  car  should 
also  minimize  its  fuel  consumption.  Initially,  the  car  knows  nothing  about  the  world  and 
cannot  look  ahead,  but  it  has  noisy  feedback  of  its  position  and  velocity.  Commands, 
which  correspond  to  the  thrust  F  of  the  motor,  can  be  generated  at  SHz. 

The  mountain  car  starts  at  its  start  point  with  one  arbitrary  initial  action  for  the  first¬ 
time  step,  then  it  brakes  and  starts  all  over  again,  assuming  the  system  can  be  reset  some^ 
how.  The  discrete  one  step  dynamics  of  the  car  are  modeled  by  an  LWR  forward  model:  ■ 

A 

=f(Xcurr«r*‘f').  ^hcre  X  =  .  (4.1) 


After  a  few  trials,  the  SSA  searches  the  data  in  memory  for  the  point  (x^  F  x^  V 
whose  outcome  x„„  can  be  predicted  with  the  smallest  local  confidence  interval.  Note  that 
this  does  not  imply  that  ||x„_„  -x„,^||  is  the  smallest  since  we  have  noise  in  the  data.  This 
best  point  is  declared  the  setpoint  of  this  stage: 


» ^S.out )  (^current  >  ^next  ^best » 


(4.2) 


and  its  local  linear  model  results  from  a  corresponding  LWR  lookup: 

^S.ou,  =  h^S.in .  ^5  )  =  AXj  +  c .  (4.3) 

Based  on  this  linear  model,  an  optimal  LQ  controller  (e.g.,  [13])  can  be  constructed  by 
minimizing  the  cost: 


M 

J  -  X.,  J"Q(X*  -  x,.J  +  r{F,  -  F,y)  (4.4) 

t=i  ' 

of  the  regulator  problem: 

=  A(x,  -  X,  .„)  +  B(F,  -  F^) ,  (4.5) 

where  Q  and  r  are  weight  factors  in  matrix  or  scalar  form,  respectively.  Solving  this 
problem  results  in  the  control  law: 
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^'=-^^curr.n,-^S.in)  +  Fs-  (4.6) 

F*  is  the  optimal  command  under  the  cost  J  to  go  from  the  current  state  to  the  set- 

point  X5  at  this  stage.  This  does  not  mean  that  the  mountain  car  will  actually  reach  x^ 
after  applying  F  ,  the  optimal  control  framework  only  guarantees  a  step  towards  the  goal 
which  reduces  the  magnitude  of  the  value  function.  In  the  given  problem  it  will  trade  speed 
accuracy  for  fuel  consumption;  the  compromise  between  the  two  factors  is  reflected  in  the 
choice  of  Q  and  r.  After  these  calculations,  the  mountain  car  learned  one  controlled  action 
for  the  first  time  step.  However,  since  the  initial  action  was  chosen  arbitrarily,  x^  will  be 
significantly  away  from  the  desired  speed  A  reduction  of  this  error  is  achieved  as 

follows.  First,  the  SSA  repeats  to  do  one  step  actions  with  the  LQ  controller  (which  is  up¬ 
dated  with  every  new  data  point)  until  sufficient  data  was  collected  to  reduce  the  size  of  the 
prediction  intervals  of  LWR  lookups  for  (x,^,„,F,)’'(4.3)  below  a  certain  threshold.  Then  it 
shifts  the  setpoint  towards  the  goal  according  to  the  procedure: 


1)  calculate  the  error  of  the  oredicted  outnnt  sfafe-  <»«•  ^ 


and  calculate  a  correction  AF^  from  solving: 

-BAFs  =  a  err^  ,  (4.8) 

e.g.,  by  singular  value  decomposition  [31];  a  e[0,l]  determines  how  much  of 
the  error  should  be  compensated  for  in  one  step. 

3)  update  F^:  F^  =  F^  -  AF^  and  calculate  the  new  x^  ^^  with  LWR  (4.3). 

4)  assess  the  fit  for  the  updated  setpoint  with  prediction  intervals.  If  the  quality  is 
above  a  certain  threshold,  continue  with  1),  otherwise  terminate  shifting. 

In  this  way,  the  output  state  of  the  setpoint  shifts  towards  the  goal  imtil  the  data  sup¬ 
port  falls  below  a  threshold.  Now  the  mountain  "car  performs  several  new  trials  with  the 
new  setpoint  and  the  correspondingly  updated  LQ  controller.  After  the  quality  of  fit  statis¬ 
tics  rise  above  a  threshold,  the  setpoint  can  be  shifted  again.  As  soon  as  the  first  stage’s 
setpoint  reduces  the  error  -  x^  to  become  close  enough  to  zero,  a  new  stage  is 
created  and  the  mountain  car  tries  to  move  one  step  further  in  its  world.  The  entire  proce¬ 
dure  is  repeated  for  each  new  stage  until  the  car  knows  how  to  move  across  the  landscape 
along  its  line  of  setpoints  with  the  associated  LQ  controllers.  Figure  4b  and  Figure  4c  show 
the  thin  band  of  data  which  the  algorithm  collected  in  state  space  and  position-action 
space.  These  two  pictures  together  form  a  narrow  tube  of  knowledge  in  the  input  space  of 
the  forward  model.  The  car  never  tried  more  than  one  exploration  step  into  unknown  terri¬ 
tory  and  thus  increased  its  probability  of  being  safe  to  a  high  level. 
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Figure  4:  The  mountain  car:  (a)  landscape  across  which  the  car  has  to  drive  at  cons¬ 
tant  velocity  of  0.8  m/s.  (b)  contour  plot  of  data  density  in  phase  space  as  generated 
by  using  multistage  SSA,  (c)  contour  plot  of  data  density  in  position-action  space 
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Figure  4:  The  rwuntam  car:  (a)  landscape  across  which  the  car  has  to  drive  at  con: 
tant  velocity  of  0.8  m/s  (b)  contour  plot  of  data  density  in  phase  space  as  generate, 
by  using  multistage  SSA,  (c)  contour  plot  of  data  density  in  position-action  space 


During  the  times  where  the  setpoint  statistics  indicate  insufficient  data  support  to 
continue  shifting,  data  collection  is  left  to  the  randomness  of  the  task  dynamics.  Thus  the 
reduction  of  parameter  uncertainty  of  the  setpoint’s  local  model  also  depends  on  this 
stochastic  process.  In  order  to  identify  the  local  model  correctly,  the  stochastic  process 
must  provide  data  in  all  dimensions  of  the  input  and  output  space.  If  not,  the  regression 
problem  (3.1a)  may  be  ill-conditioned,  resulting  in  bad  estimates  of  the  local  model.  Such 
situations  were  addressed  by  Fel’dbaum  [17]  as  the  dual  control  problem.  In  his  formula¬ 
tions,  the  optimal  command  tries  to  minimize  the  cost  and  the  uncertainty  at  the  same  time. 
So  far,  only  expensive  numerical  solutions  based  on  dynamic  programming  have  been 
found  to  this  problem  [4,  7].  As  an  inelegant  but  effective  way  out,  we  add  some  small 
amount  of  random  noise  to  the  command  F*.  The  next  section  will  demonstrate  the  impor¬ 
tance  of  this  measure. 

Exploration  has  many  facets.  Depending  on  the  task  to  be  solved,  random  explora¬ 
tion,  exploration  towards  unknown  state  space  regions,  and  exploration  towards  reduction 
of  uncertainty,  etc.,  have  been  suggested  [39].  The  SSA  exploration  algorithm  is  goal  di¬ 
rected  and  uncertainty  driven  under  the  premise  not  to  dare  any  aggressive  exploration 
outside  the  current  data  support.  It  is  targeted  at  working  in  high  dimensional  environment^' 
where  aggressive  exploration  would  spend  too  much  time  in  inappropriate  and  possibl^ 
dangerous  regions.  It  is  well  suited  for  a  real  machine  for  which  experimentation  is  time 
consuming.  The  SSA  requires  the  existence  of  explicit  goals.  However,  it  is  not  always 
necessary  to  know  these  goals  in  advance  but  rather  let  the  goals  develop  out  of  the  task 
definition,  as  will  be  shown  in  the  next  section.  The  SSA  should  be  generally  applicable  to 
problems  which  allow  a  decomposition  in  a  static  exploitation  and  a  slowly  moving  explo¬ 
ration  time  scale,  which  have  one  time  differentiable  forward  dynamics,  and  where  the 
noise  does  not  exceed  the  capabilities  of  the  LQ  controllers. 

A  System  For  Learning  Experiments:  Robot  Juggling 

We  have  constructed  a  system  for  experiments  in  real-time  motor  learning  [40].  The  task  is 
a  juggling  task  known  as  “devil  sticking”.  A  center  stick  is  batted  back  and  forth  between 
two  handsticks  (Figure  5a).  Figures  5b, c  show  a  sketch  and  photograph  of  our  devU  stick¬ 
ing  robot.  The  juggling  robot  uses  motor  1  and  motor  2  to  perform  planar  devil  sticking. 
Hand  sticks  with  springs  and  dampers  are  mounted  on  the  robot  to  implement  a  passive 
catch,  the  center  stick  does  not  bounce  when  it  hits  the  hand  stick  and  requires  an  active 
throwing  motion  by  the  robot.  For  the  time  being,  the  problem  is  simplified  by  the  center 
stick  being  constrained  by  a  boom  to  move  on  the  surface  of  a  sphere  (Figure  5b),  and  mo¬ 
tor  3  is  not  used.  For  moderate  amplitudes  these  movements  are  approximately  planar.  The 
boom  also  provides  a  way  to  measure  the  current  state  of  the  center  stick.  The  task  state  is 
the  predicted  location  at  which  the  ballistic  flight  of  the  center  stick  intersects  with  the 
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hand  stick  held  in  an  arbitrary  but  fixed  nominal  position  We 

to  be  the  hand  stick  position  of  the  “upright”  robot  as  shown  in 
Figure  5b.  As  soon  as  the  center  stick  does  not  touch  the  throwing  hand  stick  anymore, 
standard  ballistics  equations  for  the  flight  of  the  center  stick  are  used  to  map  flight  trajec¬ 
tory  measurements  {x(t),y(t),d{t))  into  the  5-dimensional  estimated  task  state  vector,  i.e., 
the  impact  state  with  the  other  hand  stick  held  at 

x  =  [p,e,x,y,ef.  (5.1) 

p  is  the  distance  of  the  devil  stick’s  center  of  mass  to  the  impact  point  hand  stick-devil 
stick  (Figure  5b).  The  task  command  is  given  by  a  displacement  (x^,yJ^of  the  hand  stick 
from  the  nominal  position  {x^,,aminai>yh.npnunaif  ^  a  center  stick  angular  velocity^threshold  to 
trigger  the  start  of  a  throwing  motion  0,,  and  a  throw  velocity  vector  (v,,vj’^of  the  hand 
stick,  measured  at  point  where  the  hand  stick  is  attached  to  the  robot . 

“  =  (5.2) 

The  dynamics  of  throwing  the  devilstick  are  thus  parameterized  by  5  state  and  5  task 
commands,  resulting  in  a  10/5-dimensional  input/output  model  for  each  hand.  Every  time5> 
the  robot  catches  and  throws  the  devil  stick  it  generates  an  experience  vector  of  the  form:  - 

,  (5.3) 

where  is  the  current  state,  is  the  action  performed  by  the  robot,  and  x^^,  is  the  state 
of  the  center  stick  that  results.  Initially  we  explored  learning  an  inverse  model  of  the  task, 
using  nonlinear  “deadbeat”  control  to  eliminate  all  error  on  each  hit.  Each  hand  had  its  own 
inverse  model  of  the  form: 

u*=r‘(x„x,,,).  (5.4) 

Before  each  hit,  the  system  looked  up  a  command  with  the  expected  impact  state  of  the 
devilstick  and  the  desired  state: 

u,=r'(x*,x^).  (5.5) 

Inverse  model  learning  was  successfully  used  to  train  the  system  to  perform  the  devil  stick¬ 
ing  task.  Juggling  runs  up  to  100  hits  were  achieved.  The  system  incorporated  new  data  in  ' 
real  time,  and  used  databases  of  several  hundred  hits.  Lookups  took  less  than  15  millisec¬ 
onds,  and  therefore  several  lookups  could  be  performed  before  the  end  of  the  flight  of  the 
center  stick.  Later  queries  incorporated  more  measurements  of  the  flight  of  the  center  stick 
and  therefore  more  accurate  predictions  of  the  state  x^  of  the  task.  However,  the  system 
required  substantial  structure  in  the  initial  training  to  achieve  this  performance.  The  system 
was  started  with  a  default  command  that  was  appropriate  for  open  loop  performance  of  the 
task.  Each  control  parameter  was  varied  systematically  to  explore  the  space  near  the  de- 
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fault  command.  A  global  linear  model  was  made  of  this  initial  data,  and  a  linear  controller 
based  on  this  model  was  used  to  generate  an  initial  training  set  for  the  memory-based  sys¬ 
tem  (approximately  100  hits).  Learning  with  no  initial  data  was  not  possible. 


(c)  INSERT  PHOTO 

Figure  5:  (a)  an  illustration  of  devil  sticking,  (b)  sketch  of  our  devil  sticking  robot:  the  flow  of  force 

from  each  motor  into  the  robot  is  indicated  by  different  shadings  of  the  robot  links,  and  a  position 
change  due  to  an  application  of  motor  1  or  motor  2,  respectively,  is  indicated  in  the  small  sketches;  (c) 

photograph  of  robot 

We  also  experimented  with  learning  based  on  both  inverse  and  forward  models.  After 
a  command  is  generated  by  the  inverse  model,  it  can  be  evaluated  using  a  memory-based 
forward  model  with  the  same  datg- 

(5.6) 
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Because  it  produces  a  local  linear  model,  the  LWR  procedure  generates  estimates  of  the 
derivatives  of  the  forward  model  with  respect  to  the  commands  as  part  of  the  estimated  pa¬ 
rameter  vector  P  (analog  to  2.18  or  4.3).  These  derivatives  can  be  used  to  find  a  correction 
to  the  command  vector  that  reduces  errors  in  the  predicted  outcome  based  on  the  forward 
model; 


— Au  =  x,„-x,.  (5.7) 

where  the  goal  state  Xj  was  calculated  off-line  from  a  comparison  with  human  juggling. 
The  pseudo-inverse  of  the  matrix  (9f  /  ^  is  used  to  solve  the  above  equation  for  Au^  in 
order  to  handle  situations  in  which  the  matrix  is  singular  or  a  different  number  of  com¬ 
mands  and  states  exists  (which  does  not  apply  for  devU  sticking).  The  process  of  command 
refinement  can  be  repeated  until  the  forward  model  no  longer  produces  accurate  predic¬ 
tions  of  the  outcome.  This  will  happen  when  the  query  to  the  forward  model  requires  sig¬ 
nificant  extrapolation  from  the  current  database. 

We  investigated  this  method  for  incremental  learning  of  devil  sticking  in  simulations 
whose  dynamics  were  adopted  from  the  real  machine.  The  outcome,  however,  did  not  meetl 
expectations:  without  sufficient  initial  data  around  the  setpoint,  the  algorithm  did  not  work.^ 
Two  main  reasons  can  be  held  responsible: 

i)  Similar  to  the  pure  inverse  model  approach,  the  inverse-forward  model  acts  as  a 
one-step  deadbeat  controller.  One-step  deadbeat  control  applies  large  com¬ 
mands  to  correct  for  deviations  from  the  setpoint.  In  the  presence  of  errors  in 
the  model,  this  is  detrimental  since  it  magnifies  the  model  errors.  Additionally, 
the  workspace  bounds  and  command  bounds  of  our  devil  sticking  robot  limit 
the  size  of  the  commands. 

11)  Due  to  the  nonlinearities  in  the  dynamics  of  the  robot,  the  10-dimensional  input 
space  of  the  forward  model  suffers  from  the  first  symptoms  of  Bellman’s 
curse  of  dimensionality”.  Error  reduction  as  described  in  (5.7)  only  works  if 
.  sufficient  data  exists  at  the  query  sites^The  inevitable  model  errors  will  make 

the  robot  explore  randomly,  leading  to  dispersed  data,  giving  little  chance  for 
model  improvements.  Imagine  we  had  to  place  data  in  a  (hyper-)cube  of  nor¬ 
malized  edge  length  0. 1 .  A  3-dimensional  input  space  has  10^  such  cubes  leav¬ 
ing  some  probability  to  finally  arrive  at  the  goal.  A  10-dimensional  state  space, 
however,  has  10^°  such  cubes  -  a  prohibitive  number  for  random  exploration. 

Thus,  two  ingredients  had  to  be  added  to  the  devil  sticking  controller: 

a)  Control  must  start  as  soon  as  possible  with  the  primary  goal  to  increase  the  data  den¬ 
sity  in  the  current  region  of  the  state-action  space,  and  the  secondary  goal  to  arrive  at 
the  desired  goal  state. 
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J'^“"  ^i.  ^w'  -^-W  algorithm  collects  data  in  space:  a)  sparse  data  after  the 

first  few  hits;  b)  high  local  data  density  due  to  local  control  in  this  region;  c)  increased  data  density  on 
the  way  to  the  goals  due  to  shifting  of  he  setpoints;  d)  ridge  of  data  density  after  the  goal  was  reached 

Both  requirements  are  fulfilled  by  the  shifting  setpoint  algorithm  (SSA).  Applied  to  devi 
sticking,  the  SSA  proceeds  as  follows: 

( 1 )  Regardless  of  the  poor  juggling  quality  of  the  robot  (i.e.,  at  most  two  or  three  hits  pe 
trial),  the  SSA  makes  the  robot  repeat  these  initial  actions  with  small  random  pertur 
bations  until  a  cloud  of  data  was  collected  somewhere  in  state-action  space  of  eacl 
hand.  An  abstract  illustration  for  this  is  given  in  Figure  6a  to  6b. 


(2)  Each  point  in  the  data  cloud  of  each  hand  is  used  as  a  candidate  for  a  setpoint  of  the 
corresponding  hand  by  trying  to  predict  its  output  from  its  input  with  LWR.  The 
point  achieving  the  narrowest  local  confidence  interval  becomes  the  setpoint  of  the 
hand  and  an  LQ  controller  is  calculated  for  its  local  linear  model.  By  means  of  these 
controllers,  the  amount  of  data  around  the  setpoints  can  quickly  and  rather  accurately 
be  increased  until  the  quality  of  the  local  models  exceeds  a  certain  statistical  thresh¬ 
old. 

(3)  At  this  point,  the  setpoints  are  gradually  shifted  towards  the  goal  setpoints  until  the 
data  support  of  the  local  models  falls  below  a  statistical  value.  Shifting  occurs  for 
both  input  state  and  output  state  of  the  setpoints  (cf.  Eq.4.2).  After  shifting,  the  kernel 
k  (cf.  Eq.  (3.7))  is  optimized  by  minimizing  the  local  cross  validation  error 

(In  Figure  6,  the  goal  setpoints  are  given  explicitly,  but  they  actually  develop  auto¬ 
matically  from  the  requirement  to  throw  the  devilstick  increasingly  close  to  a  place, 
in  which  the  other  hand  has  data  support,  i.e.,  x,  .  .  .,.=Xc 
^s.ou,.cUsired.kfi  =  ^s.in.right>  ^ud  vice  versa  for  the  other  hand) 

(4)  The  SS  A  repeats  itself  by  collecting  data  in  the  new  regions  of  the  workspace  untiL 

the  setpoints  can  be  shifted  again  (Fig.  6c).  The  procedure  terminates  by  reaching  th^ 
goal,  leaving  a  (hyper-)  ridge  of  data  in  space  (Figure  6d).  ^ 

The  LQ  controllers  play  a  crucial  role  for  devil  sticking.  Although  we  statistically 
exploit  data  thoroughly,  it  is  nevertheless  hard  to  build  good  local  linear  models  in  the  high 
dimensional  spaces,  particularly  at  the  beginning  of  learning.  LQ  control  has  useful  robust¬ 
ness  even  if  the  underlying  linear  models  are  imprecise. 

We  tested  the  SS  A  in  a  noise  corrupted  simulation  and  on  the  real  robot.  Learning 
curves  are  given  in  Figure  7.  The  learning  curves  are  typical  for  the  given  problem.  It  takes 
roughly  40  trials  before  the  setpoint  of  each  hand  has  moved  close  enough  to  the  other 
hand’s  setpoint.  For  the  simulation  (Figure  7a)  a  break-through  occurs  and  the  robot  rarely 
loses  the  devilstick  after  that.  In  Figure  7b,  the  real  robot  learning  curve  is  shown.  The  real 
robot  takes  more  trials  to  achieve  longer  juggling  runs,  and  its  performance  is  not  very 
consistent.  This  was  due  to  the  fact  that  the  stochasticities  of  the  robot  did  not  sample  the 
full  state  space  sufficiently  well  during  the  data  collection  phases  of  the  SS  A.  As  pointed 
out  in  the  dual  control  paragraph  of  the  SS  A  section,  we  now  added  some  random  noise  to  ' 
the  controls  generated  by  the  LQ  controllers.  Figure  7c  shows  the  remarkable  improvement 
in  performance.  On  average,  human  beings  need  roughly  a  week  of  1  hour  practicing  a  day 
before  they  learn  to  juggle  the  devilstick.  With  respect  to  this,  the  robot  learned  very 
quickly.  But  the  stability  of  our  controllers  is  not  global  so  far  and  will  require  future  work. 
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(a)  Trial  Number 


Figure  7:  Learning  curves  of  devil  sticking  using  the  SSA  algorithm  (a)  simulation  results  (individual 

trials  were  stopped  after  200  hits  were  reached),  (b)  real  robot  results;  (c)  real  robot  results  with  small 
amount  of  random  noise  added  LQ  controller  commands 


Discussion 

In  this  paper  we  adopted  a  nonparametric  approach  to  learning  control.  By  means  of  locally 
weighted  regression  we  built  models  of  the  world  first,  and  exploited  the  models  subse¬ 
quently  with  statistical  methods  and  algorithms  from  optimal  control  to  design  controllers. 
Despite  the  computational  complexity  of  these  methods,  we  demonstrated  the  usefulness  of 
our  algorithms  in  a  real-time  implementation  of  robot  learning. 
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Using  models  for  control  according  to  the  certainty  equivalence  principle  is  nothing 
new  and  has  been  supported  by  many  researchers  in  the  last  years  (e.g.,  [3,  38,  22,  24]). 
Using  memory-based  or  nonparametric  models,  however,  has  only  recently  received  in¬ 
creasing  interest.  One  of  the  favorable  advantages  of  memory-based  modeling  lies  in  the 
least  commitment  strategy  which  is  associated  with  it.  Since  all  data  is  kept  in  memory,  a 
lookup  can  be  optimized  with  respect  to  the  few  open  architectural  parameters.  Parametric 
approaches  do  not  have  this  ability  if  they  discard  their  training  data;  if  they  retained  all  the 
training  data  they  essentially  become  memory-based.  As  we  demonstrated  in  our  LWR  ap¬ 
proach  to  nonparametric  modeling,  several  established  statistical  methods  may  be  adopted 
to  assess  the  quality  of  a  model.  These  statistics  form  the  backbone  of  the  SSA  exploration 
algorithm.  So  far  we  have  only  examined  some  of  the  most  obvious  statistical  tools  which 
directly  relate  to  regression  analysis.  Many  other  methods  may  be  suitable  as  well,  particu¬ 
larly  in  a  Bayesian  framework. 

Training  a  memory-based  model  is  computationally  inexpensive,  as  the  data  is  sim¬ 
ply  stored  in  a  memory.  Training  a  nonlinear  parametric  model  typically  requires  an  itera¬ 
tive  search  for  the  appropriate  parameters.  Examples  of  iterative  search  are  the  various 
gradient  descent  techniques  used  to  train  neural  network  models  (e.g.,  [20]).  Lookup  oi| 
evaluating  a  memory  based  model  is  computationally  expensive,  as  described  in  this  paper.^: 
Lookup  for  a  nonlinear  parametric  model  is  often  relatively  inexpensive.  If  there  is  a  situa-' 
tion  in  which  a  fixed  set  of  training  data  is  available,  and  there  will  be  many  queries  to  the 
model  after  the  training  data  is  processed,  then  it  makes  sense  to  use  a  nonlinear  parametric 
model.  However,  if  there  is  a  continuous  stream  of  new  training  data  intermixed  with 
queries,  as  there  typically  is  in  many  motor  learning  problems,  it  may  be  less  expensive  to 
train  and  query  a  memory-based  model  then  it  is  to  train  and  query  a  nonlinear  parametric 
model. 

A  question  that  often  arises  with  memory-based  models  is  the  effect  of  memory  limi¬ 
tations.  We  have  not  yet  needed  to  address  this  issue  in  our  experiments.  However,  we  plan 
to  explore  how  memory  use  can  be  minimized  based  on  several  methods.  One  approach  is 
to  only  store  surprises  .  The  system  would  try  to  predict  the  outputs  of  a  data  point  before 
trying  to  store  it.  If  the  prediction  is  good,  it  is  not  necessary  to  store  the  point.  Another 
approach  is  to  forget  data  points.  Points  can  be  forgotten  or  removed  from  the  database 
based  on  age,  proximity  to  queries,  or  other  criteria.  Because  memory-based  learning  re¬ 
tains  the  original  training  data,  forgetting  can  be  explicitly  controlled. 

That  computational  complexity  does  not  necessarily  limit  real  time  applications  was 
demonstrated  with  our  successful  devil  sticking  robot.  We  are  able  to  do  lookups  for  mem¬ 
ory-based  local  models  in  less  than  15ms  for  a  thousand  data  points  modeling  a  10  to  5 
mapping,  and  we  are  able  to  build  on-line  LQ  controllers  in  another  Sms.  The  initial  short¬ 
comings  of  our  deadbeat  inverse  or  inverse-forward  model  controllers  are  not  due  to  the 
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LWR  learning  algorithm  but  rather  to  the  inherent  problems  of  this  kind  of  control.  As  has 
been  pointed  out  by  Jordan  and  Rumelhart  [22],  inverse  models  are  not  goal-directed  and 
perform  data  sampling  in  action  and  not  state  space.  They  do  not  establish  a  connection  be¬ 
tween  a  certain  sensation  and  a  certain  action  but  rather  a  connection  between  two  sensa¬ 
tions.  Hence,  they  do  not  learn  from  bad  actions.  A  forward  model  overcomes  these  prob¬ 
lem.  Pure  forward  model  controllers,  however,  are  still  deadbeat  controllers  which  try  to 
cancel  the  plant  dynamics  in  one  step.  This  results  in  large  commands  if  the  system  devi¬ 
ates  only  moderately  from  its  desired  goal  and  conflicts  with  the  workspace  bounds  and 
command  bounds  of  our  robot.  Additionally,  modeling  errors  are  strongly  amplified  by 
deadbeat  control.  Accurate  data  sampling,  as  it  is  necessary  in  high  dimensional  spaces, 
becomes  thus  rather  difficult. 

Due  to  the  statistical  properties  of  locally  weighted  regression,  a  simple  exploration 
algorithm  like  the  shifting  setpoint  algorithm  is  powerful  enough  to  accomplish  the  desired 
task.  Deadbeat  control  was  replaced  by  LQ  control  which  naturally  blends  into  the  LWR 
framework.  By  no  means  was  the  SSA  algorithm  intended  to  replace  high-level  controllers. 
Indeed,  it  remains  to  be  explored  in  how  far  the  chaining  of  individual  LQ  controllers  is 
actually  robust,  and  whether  an  approach  from  trajectory  optimization  [13]  would  not  b^ 
more  appropriate.  In  favor  of  the  SSA  algorithm  stands  its  easy  implementation  for  real-? 
time  systems.  1 

Two  crucial  prerequisites  entered  our  explanations  on  robot  learning.  First,  we  as¬ 
sumed  we  know  the  input/output  representations  of  the  task,  and  second,  we  were  able  to 
generate  a  goal  state  for  the  SSA  exploration.  A  good  choice  of  a  representation  is  crucial 
in  order  to  be  able  to  accomplish  the  goal  at  all  [33,  35],  and  we  have  very  limited  insight 
so  far  how  to  automate  this  part  of  the  learning  process.  Of  equal  importance  is  a  good 
choice  of  a  goal  state.  In  devil  sticking,  the  goal  state  developed  out  of  the  necessity  that 
the  left  and  right  hand  have  to  cooperate.  The  initial  action,  however,  which  was  given  by 
the  experimenter,  clearly  determined  in  which  ballpark  the  juggling  pattern  would  lie. 
Certain  patterns  of  devil  sticking  are  easier  than  others  [34],  and  we  picked  an  initial  action 
of  which  we  knew  that  it  was  favorable.  One  part  of  our  future  work  will  address  these  is¬ 
sues  in  more  detail  in  that  we  search  for  good  initial  actions  and  strategies  to  approach  a 
task  [6]. 

Conclusions 

The  paper  demonstrated  that  a  real  robot  can  indeed  learn  a  non-trivial  task.  As  pointed  out 
above,  by  taking  input/output  representations  and  good  learning  goals  as  given,  a  large 
portion  of  the  task  was  already  solved  in  advance.  Solving  the  remaining  problems  became 
practicable  mainly  because  of  the  characteristics  of  the  LWR  learning  method.  The  local 
linear  models  that  this  algorithm  generated  at  every  query  point  allowed  us  to  make  use  of 
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optimal  control  techniques  which  added  useful  robustness  to  the  controllers.  Since  LWR  is 
memory-based,  the  local  linear  models  could  be  optimized  with  respect  to  statistical  uncer¬ 
tainty  measures.  These  measures  also  served  as  the  basis  of  the  SSA  exploration  algorithm. 
Such  statistical  tools  are  not  generally  available  in  learning.  LWR  is  particularly  suited  to 
exploit  statistics  since  it  originates  from  a  statistical  method,  and  we  could  thus  easily  as¬ 
sure  compatibility  of  the  statistics  and  the  learning  algorithm.  As  a  last  point,  LWR  does 
not  suffer  from  problems  of  interference  when  being  trained  on  new  data.  Interference 
means  a  degradation  of  performance  in  one  part  of  the  model  when  training  the  model  with 
data  relevant  for  different  parts.  Such  an  effect  could  happen  during  SSA  shifting  if  a 
parametric  learning  method  were  applied.  But  since  lookups  with  LWR  are  affected  only 
by  a  small  cloud  of  data  in  the  neighborhood  of  the  query  point,  interference  problems  are 
greatly  reduced. 

Our  future  work  will  focus  on  extending  LWR  model-based  control  to  multistage 
problems  in  the  optimal  control  domain.  Devil  sticking  should  not  only  be  stable  within  the 
validity  of  the  local  linear  controllers  but  rather  exhibit  global  stability.  This  requires  non¬ 
linear  optimal  control  and  planning  techniques  which  we  are  currently  exploring.  Future 
work  must  furthermore  address  how  we  could  approach  tasks  in  which  complete  measure-^ 
ments  of  the  states  are  not  available,  or  what  constimtes  a  state  is  not  even  known.  i 
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Appendix  A 

The  conversion  from  unweighted  linear  regression  to  locally  weighted  regression  analysis 
is  done  mostly  by  inserting  the  matrix  W  at  the  appropriate  sites.  The  definitions  of  the  pa¬ 
rameters  n’  and  p’,  however,  need  some  explanation.  Imagine  the  weighting  function  is  not 
a  soft-weighting  function  (e.g.,  a  Gaussian)  but  rather  a  hard- weighting  function  clipping 
off  points  beyond  a  certain  threshold;  w,-  =  1  if  <  s,  otherwise  w,  =  0 .  Redefining 
n  to  n’  accommodates  such  a  k-nearest  neighbor  weighting  and  transforms  the  n-point  re¬ 
gression  problem  to  an  k-point  regression.  If  p  stays  unchanged,  all  statistics  would  corre¬ 
spond  to  unweighted  regression.  Subtraction  of  p  from  n  to  calculate  variances  aims  at 
achieving  unbiased  estimators.  For  LWR,  it  is  easfly  possible  to  find  the  case  where  n’<p. 
In  the  above  mentioned  k-nearest  neighbor  example,  this  would  mean  that  we  do  not  have 
sufficient  data  support  for  the  regression  model.  For  soft-weighting  functions,  the  problem 
cannot  be  resolved  so  clearly;  we  could  always  imagine  scaling  up  all  weights  by  a  con¬ 
stant  factor  so  that  which  would  not  change  the  regression  variables.  LWR 

weights  data  with  respect  to  each  other  and  not  absolutely.  Applying  n’  instead  of  n  to  cal¬ 
culate  variances  or  mean  squared  errors  makes  such  measures  invariant  towards  mere 
weight  scaling;  this  can  easily  be  verified  by  setting  w,.  =  w,.  -const..  Defining  p’  in  the 
given  way  avoids  problems  when  n’<p.  The  bias  introduced  this  way  should  diminish  with 
an  increasing  number  of  data  points  in  memory.  One  could  argue  in  favor  of  neglecting  p 
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entirely,  but  incorporating  it  in  the  given  way  makes  the  statistics  stay  on  the  more  pes¬ 
simistic  side  which  seems  reasonable. 

The  only  statistics  which  see  a  direct  influence  of  a  weight  scaling  are  the  prediction 
intervals  and  the  standardized  individual  PRESS  residuals.  Restricting  the  weights  to  the 
range  [0,1]  and  requiring  that  the  weight  of  a  point  with  zero  Euclidean  distance  from  the 
lookup  point  equals  1  resolves  this  problem.  The  dependence  of  the  t-value  on  n’-p’  in  the 
prediction  intervals  increases  the  t-value  and  thus  the  prediction  intervals  with  diminishing 
n’.  The  smallest  value  t  may  acquire  is  for  n=n’,  i.e.,  unweighted  regression.  The  standard¬ 
ized  individual  PRESS  residual  e.  ^  is  proportional  to  the  magnitude  of  the  i-th  weight. 
As  we  pointed  out  in  Section  x3,  this  measure  has  zero  mean  and  unit  variance.  With  in¬ 
creasing  distance  from  the  current  query  point,  it  will  be  weighted  down,  i.e.,  the  likeli¬ 
hood  of  being  an  outlier  diminishes  even  if  e.  ^  is  rather  large.  As  the  weights  cannot  be 
larger  than  1,  e,.  ^  cannot  assume  larger  values  as  for  unweighted  regression. 

It  must  be  pointed  out  that  statistics  literature  provides  much  more  sophisticated  and 
mathematically  exact  statistics  for  locally  weighted  regression  [19,  28,  14,  11].  However, 
most  of  these  measures  require  the  estimation  of  Hessians  and/or  data  densities  which  foii^ 
high  dimensional  problems  are  not  readily  adapted  without  numerical  problems.  Our  LWRf 
statistics  are  used  to  tune  fit  parameters  and  need  not  give  precise  statistical  assessments. 


Table  1: 


Unweighted  Linear  Regression 


Locally  Weighted  Regression 


n ,  we  define:  ^  w? , 

t=l 

’,  e[0,l],  w,.  =  1  if  x,  =  x,, 
Xj  is  the  current  query  point 
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Abstract 

This  paper  explores  a  memory-based  approach  to  robot  learning,  using  memory-based  neural 
networks  to  learn  models  of  the  task  to  be  performed.  Steinbuch  and  Taylor  presented  neural 
network  designs  to  explicitly  store  training  data  and  do  nearest  neighbor  lookup  in  the  earlyi 
1960s.  In  this  paper  their  nearest  neighbor  network  is  augmented  with  a  local  model  network^’ 
which  fits  a  local  model  to  a  set  of  nearest  neighbors.  This  network  design  is  equivalent  to  h 
statistical  approach  known  as  locally  weighted  regression,  in  which  a  local  model  is  formed  to 
answer  each  query,  using  a  weighted  regression  in  which  nearby  points  (similar  experiences)  are 
weighted  more  than  distant  points  (less  relevant  experiences). 

The  memory-based  neural  network  architecture  can  represent  smooth  nonlinear  functions, 
yet  has  simple  training  rules  with  a  single  global  optimum.  The  paper  explains  how  an  appro-’ 
priate  distance  metric  or  measure  of  similarity  can  be  found,  and  how  the  distance  metric  is 
used.  We  localize  the  architectural  parameters  of  the  approach,  so  that  parameters  such  as  dis¬ 
tance  metrics  are  a  function  of  the  current  query  point  instead  of  being  global.  The  paper  also 
explains  how  irrelevant  input  variables  and  terms  in  the  local  model  are  detected.  Statistical 
tests  are  presented  for  when  a  local  model  is  good  enough  and  sampling  should  be  moved  to  a 
new  area.  Our  methods  explicitly  deal  with  the  case  where  prediction  accuracy  requirements 
exist  during  exploration.  By  gradually  shifting  a  center  of  exploration  and  controlling  the  speed 
of  the  shift  based  on  local  prediction  accuracy,  a  goal-directed  exploration  of  state  space  takes 
place  along  the  fringes  of  the  current  data  support  until  the  task  goal  is  achieved. 

We  illustrate  this  approach  by  describing  how  it  has  been  used  to  enable  a  robot  to  learn  a 
difficult  juggling  task. 


1  Introduction 

M  important  problem  in  motor  learning  is  approximating  a  continuous  function  from  samples  of 
the  function’s  inputs  and  outputs.  This  paper  explores  a  neural  network  architecture  that  simply 
remembers  experiences  (samples)  and  builds  a  locsd  model  to  answer  any  particular  query  (an  input 
for  wWch  the  function’s  output  is  desired).  Steinbuch  (Steinbuch  1961,  Steinbuch  and  Piske  1963) 
and  laylor  (Taylor  1959,  Taylor  1960)  independently  proposed  neural  network  designs  that  explicitly 
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remembered  the  traimng  experiences  and  used  a  local  representation  to  do  nearest  neighbor  lookup 
They  pointed  out  that  this  approach  could  be  used  for  control.  They  used  a  layer  of  hidden  units  to 
compute  an  inner  product  of  each  stored  vector  with  the  input  vector.  A  winner-take-aU  circuit  then 
selected  the  hidden  unit  with  the  highest  activation.  This  type  of  network  can  find  nearest  neighbors 
or  best  matches  using  a  Euclidean  distance  metric  (Kazmierczak  and  Steinbuch  1963)  In  this  paper 
their  nearest  neighbor  lodkup  network  (which  I  wifi  refer  to  as  the  memory  network)  is  augmented 
with  a  local  model  network,  which  fits  a  local  model  to  a  set  of  nearest  neighbors. 

The  memory-based  neural  network  design  can  represent  smooth  nonlinear  functions,  yet  has 
^mple  training  rules  with  a  single  global  optimum  for  building  a  local  model  in  response  to  a  query 
Our  philosophy  is  to  model  complex  continuous  functions  using  simple  local  models.  This  approach 
avoids  the  difficult  problem  of  finding  an  appropriate  structure  for  a  global  model  and  allows  complex 
nonlinear  models  to  be  identified  (trained)  quickly.  A  key  idea  is  to  form  a  training  set  for  the  local 
model  network  after  a  query  to  be  answered  is  known.  This  approach  allows  us  to  include  in  the 
training  set  only  relevant  experiences  (nearby  samples),  and  to  weight  the  experiences  according  to 
their  rele^•ance  to  the  query.  The  local  model  network,  which  may  be  a  simple  network  architecture 
such  as  a  perceptron,  forms  a  model  of  the  portion  of  the  function  near  the  query  point,  much  as 
a  Taylor  series  models  a  function  in  a  neighborhood  of  a  point.  This  local  model  is  then  used  to 
predict  the  output  of  the  function,  given  the  input.  After  answering  the  quer3^  a  new  local  model  is 
trmned  to  answer  the  next  query.  This  approach  minimizes  interference  between  old  and  new.  data, 
and  aJows  the  range  of  generalization  to  depend  on  the  density  of  the  samples. 

Currently  we  are  using  polynomials  as  the  local  models.  Since  the  polynomial  local  modds  are 
linear  in  the  unknown  parameters,  we  can  estimate  these  parameters  using  a  linear  regression.  We 
use  cross  validation  to  choose  an  appropriate  distance  metric  and  weighting  function,  and  to  help  find 
irrelevant  input  variables  and  terms  in  the  local  model.  In  this  approach  cross  validation  is  no  more 
computationally  expensive  than  answering  a  query.  This  is  quite  different  from  a  parametric  neural 
network,  where  a  new  network  must  be  trained  for  each  cross  vahdation  training  set.  We  extend 
this  approach  to  give  information  about  the  reliability  of  the  predictions  and  local  linearizations 
generated  by  locally  weighted  regression.  This  allows  the  robot  to  monitor  its  own  skill  level  and 
guide  Its  exploratory  behavior.  The  polynomial  local  models  aUow  us  to  efficiently  estimate  local 

linear  models  for  different  points  in  the  state  space.  These  local  linear  models  are  used  in  several 
ways  during  learning. 

We  use  several  forms  of  indirect  learning,  where  a  model  is  learned  and  then  control  actions 
are  chosen  based  on  the  model,  rather  than  direct  leaiiing,  where  appropriate  control  actions  are 
learned  directly.  Our  starting  point  for  modeling  is  that  we  do  not  know  the  structure  or  form  of 
the  system  to  be  controlled.  We  assume  we  do  know  what  constitutes  a  state  of  the  system,  an’d 
that  we  measure  the  complete  state.  Later  papers  will  discuss  how  we  could  approach  tasks  in  which 
complete  measurements  of  the  state  are  not  available,  or  what  constitutes  a  state  is  not  even  known. 

':^e  learned  models  are  multidimensional  functions  that  are  approximated  from  sampled  data  (the 
previous  experiences  or  attempts  to  perform  the  task).  Goals  for  function  approximation  in  robot 
learning  go  beyond  being  able  to  represent  the  training  set  and  generalize  appropriately.  The  learned 
models  are  used  in  a  variety  of  ways  to  successfully  execute  the  task.  We  would  like  the  models  to 
incorporate  the  latest  information.  The  models  will  be  continuously  updated  with  a  stream  of  new 
traimng  data,  so  updating  a  model  with  new  data  should  take  a  short  period  of  time  There  are 
also  time  constraints  on  how  long  it  can  take  to  use  a  model  to  make  a  prediction.  Because  we  are 
interested  in  control  methods  that  make  use  of  local  linearizations  of  the  plant  model,  we  want  a 
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representation  that  can  quickly  compute  a  local  linear  model  of  the  represented  transformation.  We 
also  need  to  be  able  to  find  first  (and  potentially  second)  derivatives  of  the  learned  function.  We 
would  like  to  mimmize  the  negative  interference  from  learning  new  knowledge  on  previously  stored 
imorrnation.  The  ability  to  tell  where  in  the  input  space  the  function  is  accurately  approximated 
IS  very  useful.  This  is  typically  based  on  the  local  density  of  samples,  and  an  estimate  of  the  local 
variance  of  the  outputs.  This  ability  is  used  in  iterative  use  of  the  model  to  determine  when  to 
terminate  search  and  collect  more  data. 


2  Locally  Weighted  Regression 

As  the  most  generic  approximator  that  satisfies  many  of  these  criteria,  we  are  exploring  a  version  of 
memory-based  learning  techmque  called  locally  weighted  regression  (LWR)  (Cleveland  et  al.  1988 
Atkeson  1990).  A  memory-based  learning  (MBL)  system  is  trained  by  storing  the  training  data  in  a 
memory^  This  allows  MBL  systems  to  achieve  real-time  learning.  MBL  avoids  interference  between 
new  md  old  data  by  retaining  and  using  all  the  data  to  answer  each  query.  MBL  approximates 
complex  functions  using  simple  local  models,  as  does  a  Taylor  series.  Examples  of  types  of  local 
models  include  nearest  neighbor,  weighted  average,  and  locally  weighted  regression  (Figure  1).  Each 
of  these  local  models  combine  points  near  to  a  query  point  to  estimate  the  appropriate  output 
Nearest  neighbor  local  models  simply  choose  the  closest  point  and  use  its  output  value.  Weighted 
averap  local  models  sum  the  outputs  of  nearby  points  weighted  by  their  distance  to  the  query  point. 
Locally  weighted  regression  fit  a  surface  to  nearby  points  using  a  distance  weighted  regression. 

The  weights  in  the  locally  weighted  regression  depend  on  parameters  used  to  calculate  a  distance 
metric  and  a  weighting  function,  and  stabili.-je  the  solution  to  the  regression.  We  will  refer  to  these 

parmneters  as  ‘‘architectural  p^ameters”.  These  parameters  can  be  optimized  automatically  in  a 
local  fashion  using  cross  validation. 

Locally  weighted  regression  uses  a  relatively  complex  regression  procedure  to  form  the  local  model 
and  IS  thus  more  expensive  than  nearest  neighbor  and  weighted  average  memory-based  learning 
procedures.  For  each  query  a  new  local  model  is  formed.  The  rate  at  which  local  models  can  be 
fornied  and  evaluated  limits  the  rate  at  which  queries  can  be  answered.  This  section  describes  how 
locally  weighted  regression  can  be  implemented  in  real  time. 


2.1  All  example 

As  an  example  of  modeling  a  function  using  locally  weighted  regression,  we  will  consider  a  problem 
from  motor  control  and  robotics,  two-joint  arm  inverse  dynamics.  We  will  predict  torques  at  the 
snomder,  Ti,  and  elbow,  tj,  on  the  basis  of  joint  positions,  and  02,  joint  velocities,  di  and  02  and 
joint  accelerations,  and  02.  We  use  this  example  because  we  already  know  the  idealized  function 
^d  will  be  able  to  assess  how  well  the  locally  weighted  regression  procedure  is  doing  and  interpret 
he  ^^eters  used  to  improve  the  fit.  In  an  actual  application  a  structured  model  (An,  Atkeson, 
an  HoUerbach  1988,  for  example)  would  be  used  to  fit  the  dynamics  data,  and  the  locally  weighted 
regression  would  be  used  to  fit  the  errors  (residuals)  of  the  structured  model 

for  wUch  we  want  to  predict  the  shoulder  and  elbow 
rques.  We  wU  first  show  how  an  unweighted  regression  can  be  used  to  form  a  global  model  Then 
we  will  show  how  a  weighted  regression  can  be  used  to  foim  a  local  model  appropriate  to  answer  this 
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Figure  1:  Fits  using  different  types  of  local  models  for  three  and  five  data  points. 
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particular  query.  For  the  purposes  of  this  example  we  wiU  assume  a  quadratic  model  is  used  in  the 
regression.  In  this  dynamics  example  there  are  28  terms  in  the  quadratic  model: 

^  ^2,  6x,  $2,  $1,  $2, 

Si*  62,  Oi*di,  ^1*^2)  ^1*^1,  Si  *02, 

02*02,  ^2*^1,  02  *  02,  02  *  01,  02  *  02, 

01  *  01,  61  *  62,  h  *  h,  01  *  02, 

02  *  02,  02  *  01,  02  *  02, 

^1*^1,  01*02, 

02  *  02 

where  1  represents  the  constant  term  in  the  model. 

Let  us  ^sume  we  have  lOCO  samples  of  the  two  joint  arm  dynamics  function.  To  form  a  local 
model  of  the  shoulder  torque  involves  finding  estimates  of  the  28  terms  or  parameters  of  the  local 
quadratic  model.  The  equation  to  be  solved  is 

X/3  =  y  (1) 

where  X  is  a  1000  by  28  data  matrix,  in  which  each  row  has  the  28  terms  of  the  quadrati-'  model 
corresponding  to  a  point  (sample  of  the  function),  and  each  column  corresponds  to  a  particular  term 
in  the  quadratic  model.  0  is  the  vector  of  28  estimated  parameters  of  the  quadratic  model,  and  y  is 
the  vector  of  1000  shoulder  torques  from  the  1000  points  included  in  the  regression.  5 

An  unweighted  regression  finds  the  solution  to  the  normal  equations: 

(X^X)^  =  X^y  (2) 

The  estimated  parameters  are  used,  with  the  query  point,  to  predict  the  shoulder  torque  for  the 
query  point.  Another  set  of  parameters  are  estimated  for  the  elbow  torque. 

However,  we  assume  the  global  quadratic  model  is  not  the  correct  model  structure  for  predicting 
the  torques.  These  structural  modeling  errors  imply  that  different  sets  of  parameters  are  estimated 
by  the  regression,  g^.ven  different  data  sets.  The  data  set  can  be  tailored  to  the  query  point  by 
emphasizing  nearby  points  in  the  regression.  The  origin  of  the  input  data  is  first  shifted  by  subtracting 
the  query  point  from  each  data  point.  Then  each  data  point  is  given  a  weight. 

Unweighted  regression  gives  distant  points  equal  influence  with  nearby  points  on  the  ultimate  an¬ 
swer  to  th6  query,  for  equally  spaced  data.  To  weight  similar  points  more,  locally  weighted  regression 

IS  used.  First,  a  distance  is  calculated  from  each  of  the  stored  data  points  (rows  in  the  X  matrix)  to 
the  query  point  q:  ’ 

^  =  (3) 

For  the  robot  arm  dynamics  example,  is  calculated  for  each  point  in  the  foUowing  way: 

=  Tnl{9i  -  +  ml{e2  -  +  ml{ei  -  91)^ 

+  ml{92  -  9*f  +  Tnl{9i  -  91)^  +  ml{92  -  *  (4) 

The  superscript  •  indicates  the  query  point,  and  the  m,-  are  the  components  of  the  distance  metric, 
ine  weight  for  each  stored  data  point  is  a  function  of  that  distance: 

=  (5) 
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Bai  row  i  of  X  y  is  multiplied  by  the  corresponding  weight  w,.  A  simple  weighting  function 
just  rmses  the  *st^ce  to  a  negative  power.  The  magnitude  of  the  power  determines  how  local  the 
regression  will  be  (the  rate  of  dropolf  of  the  weights  with  distance). 


1 

df 


(6) 

This  type  of  weighting  function  goes  to  infinity  as  the  query  point  approaches  a  stored  data  point. 
This  forces  the  locally  weighted  regression  to  exactly  match  that  stored  point.  If  the  data  is  noisy 
exact  interpolation  is  not  desirable,  and  a  weighting  scheme  with  limited  magnitude  is  desired.  One 
such  scheme,  which  we  use  in  implementations  on  actual  robots,  is  a  Gaussian  kernel: 


tVi  =  exp 


(7) 


The  parameter  i  scales  the  size  of  the  kernel  to  determine  how  local  the  regression  will  be. 

A  potential  pr^lem  is  that  the  data  points  may  be  distributed  in  such  a  way  as  to  make  the 
regression  matrix  X  nearly  singular.  Ridge  regression  (Draper  and  Smith  1981)  is  used  to  prevent 

Solved  for  equaUon,  with  X  and  y  already  weighted,  is 

(X’’X  +  A)^  =  X»'y  J  (g) 

where  A  is  a  diagonal  matrix  with  smaU  positive  diagonal  elements  A,?.  This  is  equivalent  to  adding 
1  extra  rows  to  X,  each  having  a  single  non-zero  element,  Ai,  in  the  ith  column.  Adding  additional 
rows  can  be  viewed  as  adding  “fake"  data,  which,  in  the  absence  of  sufficient  real  data,  biases  the 
pMameter  estimates  to  zero  (Draper  and  Smith  1981).  Another  view  of  ridge  regression  parameters 
IS  that  they  are  the  Bayesian  assumptions  about  the  apriori  distributions  of  the  estimated  parameters 


2.2  Assessing  the  computational  cost 

Looimp  has  tffiee  stages:  forming  weights,  forming  the  regression  matrix,  and  solving  the  normal 
equations.  Let  m  rxai^e  how  the  cost  of  each  of  these  stages  grows  with  the  size  of  the  data  set 
and  dimen^ionauty  of  the  problem.  We  will  assume  a  linear  local  model. 

Fo^ng  and  applying  the  weights  involves  scanning  the  entire  data  set,  so  it  scales  linearly 
with  the  number  of  data  points  in  the  database  (n).  For  each  of  d  input  dimensions  there  are  a 
constant  number  of  operations,  so  the  number  of  operations  scales  linearly  with  the  .number  of  input 
dimensions.  Note  that  we  can  eliminate  points  whose  distance  is  above  a  threshold,  reducing  the 
number  of  pomts  considered  in  subsequent  stages  of  the  computation. 

Each  eluent  of  X^X  and  X^y  is  the  inner  (dot)  product  of  two  columns  of  X  or  y.  The  archi- 

tecture  of  distal  signffi  proceasors  is  ideally  suited  for  this  computation,  which  consists  of  repeated 

multrphes  and  accuinffiates  The  computation  is  linear  in  the  number  of  rows  n  and  quadratic  in  the 

tT  *  u),  where  d  is  the  number  of  input  dimensions  and  o  is  the  number  of 

output  aimensious. 

of  ifolu  A  ““  decomposition,  which  is  cubic  in  the  number 

of  input  dimensions,  and  mdependent  of  the  number  of  data  points.  Other  more  sophisticated  and 
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more  ^pensive  decompositions,  such  as  the  singular  value  decomposition,  do  not  need  to  be  used 
since  the  ridge  recession  procedure  guarantees  that  the  normal  equations  will  be  well-conditioned. 

The  most  straightforward  paraUel  implementation  of  locaUy  weighted  regression  would  distribute 
the  data  points  among  several  processors.  Queries  can  be  broadcast  to  the  processors,  and  eacli 
processor  can  weight  its  data  set  and  form  its  contribution  to  X^X  and  X^y.  These  contributions 
can  be  summed  and  the  full  normal  equations  solved  on  a  single  processor.  The  communication  costs 
are  linear  m  the  number  of  processors  and  quadratic  in  the  number  of  columns  (£  + d  *  o)  and 
independent  of  the  total  number  of  points. 

We  have  implemented  the  local  weighted  regression  procedure  on  a  33MHz  Intel  i860  micro¬ 
processor.  The  peak  computation  rate. of  this  processor  is  66  MFlops.  We  have  achieved  effective 
computation  rates  of  20  MFlops  on  a  learning  problem  with  10  input  dimensions  and  5  output  di¬ 
mensions,  using  a  hnear  local  model.  This  leads  to  a  lookup  time  of  approximately  20  milliseconds 
on  a  database  of  1000  points. 

TOs  memory-b^ed  approach  can  also  be  simulated  using  k-d  tree  data  structures  (Friedman 
Bentley,  and  Pmkel  1977)  on  a  standard  serial  computer  and  using  paraUel  search  on  a  massiveli 
paraUel  computer,  the  Connection  Machine  (Hillis  1985). 


2.3  Implementing  locally  weighted  regression  in  a  neural  network 

The  m^emory  network  of  Steinbuch  and  Taylor  can  be  used  to  find  the  nearest  stored  vectors  to  the 
current  input  vector.  The  memory  network  computes  a  measure  of  the  distance  between  each  stored 
vector  ^d  the  input  vector  in  parallel,  and  then  a  “winner  take  all”  network  selects  the  nearest 
vector  (nemest  nei^bor).  Euclidean  distance  has  been  chosen  as  the  distance  metric,  because  the 
Euchdean  distance  is  invariant  under  rotation  of  the  coordinates  used  to  represent  the  input  vector. 

The  memory  n^work  consists  of  three  layers  of  units:  input  units,  hidden  or  memory  units 
and  output  units.  The  input  units  are  fully  connected  to  the  hidden  units.  The  squared  Euclidean 

^stance  between  the  mput  vector  (i)  and  a  weight  vector  (w*)  for  the  connections  of  the  input  units 
to  nidden  unit  k  is  given  by: 


=  (i  -  w*)T (i  -  w*)  =  +  wjw* 

Since  the  qu^tity  is  the  same  for  aU  the  hidden  units,  minimizing  the  distance  between  the  input 
vector  and,  the  weight  vector  for  each  hidden  unit  is  equivalent  to  maximizing: 


-  l/2wjwi 

quality  is  the  imer  product  of  the  input  vector  and  the  weight  vector  for  hidden  unit  k,  biased 
by  ha^  he  squared  ku^h  of  the  weight  vector.  Maximizing  this  quantity  using  a  winner  take  aU 
circuit  allows  the  umt  with  the  smallest  distance  to  be  selected. 

Dynamics  of  the  memory  network  neurons  allow  the  memory  network  to  output  a  sequence  of 
newest  neighbors.  Tuese  nearest  neighbors  form  the  selected  training  sequence  for  the  local  model 
network.  Memory  urdt  dynamics  can  also  be  used  to  allocate  “free”  memory  units  to  new  experiences, 
and^  forget  old  tiaimng  pomts  when  the  capacity  of  the  memory  network  is  fully  utilized 

The  local  model  network  consists  of  only  one  layer  of  modifiable  weights  preceded  by  any  number 

K  be  arbitrary  preprocessing  of  the  inputs  of  the  local 

model,  but  the  local  model  is  hnear  m  the  parameters  used  to  form  the  fit.  The  local  model  network 
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using  the  LMS  training  algorithm  performs  a  linear  regression  of  the  transformed  inputs  against  the 
desired  outputs.  Thus,  the  local  model  network  can  be  used  to  fit  a  linear  regression  model  to  the 
selected  training  set.  With  multiplicative  interactions  between  inputs  the  local  model  network  can 
be  used  to  fit  a  polynomial  surface  (such  as  a  quadratic)  to  the  selected  training  set.  An  alternative 
implementation  of  the  local  model  network  could  use  a  single  layer  of  “sigma-pi”  units  (Durbin  and 
Rumelhart  1989). 

This  network  design  has  simple  training  rules.  In  the  memory  network  the  weights  are  simply 
the  vdues  of  the  components  of  input  and  output  vectors,  and  the  bias  for  each  memory  unit  is  just 
half  the  squared  length  of  the  corresponding  input  weight  vector.  No  search  for  weights  is  necessary 
since  the  weights  axe  directly  given  by  the  data  to  be  stored.  The  local  model  network  is  Unear  in 
me  weights,  leading  to  a  single  optimum  which  can  be  found  by  Unear  regression  or  gradient  descent 
Thus,  convergence  to  the  global  optimum  is  guaranteed  when  fonning  a  local  model  to  answer  a 
particular  query. 


3  Related  Work 

Memory-based  representations  have  a  long  history.  Approaches  which  represent  previous  experiences 
directly  and  use  a  similar  experience  or  similar  experiences  to  form  a  local  model  are  often  referred 
to  as  nearest  neighbor  oi  k-nearest  neighbor  approaches.  Local  models  (often  polynomials)  have 
bera  used  for  many  years  to  smooth  time  series  (Sheppard  1912,  Sherrilf  1920,  Whittakw  and 
Robinson  1924,  Macauley  1931)  and  interpolate  and  extrapolate  from  limited  data.  Barnhill  (1977) 
and  Sabm  (1980)  survey  the  use  of  nearest  neighbor  interpolators  to  fit  surfaces  to  arbitraiily  spaced 
points.  Eubank  (1988)  surveys  the  use  of  nearest  neighbor  estimators  in  nonparametric  regression. 
Lancaster  and  Salkauskas  (1986)  refer  to  nearest  neighbor  approaches  as  “moving  least  squares”  and 
survey  their  use  in  fitting  surfaces  to  data.  Farmer  and  Sidorowich  (1988a,  1988b)  survey  the  use  of 
nemest  neighbor  and  local  model  approaches  in  modeUng  chaotic  dynamic  systems.  Kawamura  and 
Nakagawa  (1990)  and  Kawamura,  Noborio,  and  Nakagawa  (1990)  describe  approaches  to  memory- 
based  control  of  robots.  Specht  (1991)  describes  a  memory-based  neural  network  approach  based  on 
a  probabilistic  model  that  motivates  using  weighted  averaging  as  the  local  model. 

An  early  use  of  direct  storage  of  experience  was  in  pattern  recognition.  Fix  and  Hodges  (1951, 
1952)  suggested  that  a  new  pattern  could  be  classified  by  searching  for  similar  patterns  among  a  set  of 
stored  patterns,  and  using  the  categories  of  the  similar  patterns  to  classify  the  new  pattern.  Steinbuch 
and^ylor  proposed  a  neural  network  implementation  of  the  direct  storage  of  experience  and  nearest- 
neighbor  search  process  for  pattern  recognition  (Steinbuch  1961,  Taylor  1959),  and  pointed  out  that 
this  approaA  could  be  used  for  control  (Steinbuch  and  Piske  1963).  Stanfill  and  Waltz  (1986) 
proposed  using  directly  stored  experience  to  learn  pronunciation,  using  a  Connection  Machine  and 

ir  relevant  experience.  They  have  also  applied  their  approach  to  medical  diagnosis 

(Waltz  1987)  and  protein  structure  prediction. 

Nearest  neighbor  approaches  have  also  been  used  in  nonparametric  regression  and  fitting  surfaces 
to  data.  Often,  a  group  of  similar  experiences,  or  nearest  neighbors,  is  used  to  form  a  local  model, 
^d  then  that  model  is  used  to  predict  the  desired  value  for  a  new  point.  Local  models  are  formed 
for  each  new  access  to  the  memory.  Watson  (1964),  RoyaU  (1966),  Crain  and  Bhattacharyya  (1967), 
Cover  (1968)  ^d  Shepard  (1968)  proposed  using  a  weighted  average  of  a  set  of  nearest  neighbors. 
Gordon  and  Wixom  (1978)  and  Barnhill,  Dube,  and  Little  (1983)  analyze  such  weighted  average 
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schemes^  Crain  and  Bhattacharyya  (1967),  Falconer  (1971),  and  McLain  (1974)  suggested  usin^  a 
weighted  regression  to  fit  a  local  polynomial  model  at  each  point  a  function  evaluation  was  desired 
data  points  were  used.  Each  data  point  wa^  weighted  by  a  function  of  its 
distance  to  he  desired  point  m  the  regression.  McIntyre,  PoUard,  and  Smith  (1968),  Pelto  E’ldns 

(1970),  Stone  (197o),  and  FVardce  and  Nielson  (1980)  suggested  fitting  a  polynomial  surface  to  a 
set  of  nearest  neighbors,  also  using  distance  weighted  regression.  Stone  scaled  the  values  in  each 
dimension  when  the  experiences  where  stored.  The  standard  deviations  of  each  dimension  of  previous 
expenences  were  used  ^  the  scaling  factors,  so  that  the  range  of  values  in  each  dimension  were 

distance  metric  used  to  measure  closeness  of  points.  Cleveland 
(1979)  proposed  using  robust  regression  procedures  to  eliminate  outlying  or  erroneous  points  in 
he  regression  process.  A  program  implementing  a  refined  version  of  this  approach  (LOESS)  is 
avmlable  by  sending  elertronic  mail  containing  the  single  line,  send  dloess  from  a,  to  the  address 
nethb@research.att.com  (Crosse  1989).  Cleveland,  Devlin  and  Crosse  (1988)  analyze  the  statistical 

a977"ir82t  n*  C'evdMd  and  DavBn  (1988)  show  examples  of  its  use.  Stone 

(  977,  1982),  Devroye  (1981),  Lancaster  (1979),  Lancaster  and  Sal^uskas  (1981)  Chene  0984)  Li 

984  ,  Farwig  (1987),  and  Muller  (1987)  ptow.de  analyses  of  neatest  neighbor 
surfaces*^t™d^r  P®^o^^^ce  of  nearest  neighbor  approaches  with  other  methods  for  fitting 

4  Learning  in  simulation 

The  ne^ork  has  been  used  for  motor  learning  of  a  simulated  arm  and  a  simulated  running  machine 
The  network  performed  surprisingly  weU  in  these  simple  evaluations.  The  simulated  arm  was  able 
to  Mow  a  desired  trajectory  after  only  a  few  practice  movements.  Performance  of  the  simulated 

ru^ng  machine  in  foUowmg  a  senes  of  desired  velocities  was  also  improved.  This  paper  will  report 
only  on  the  arm  trajectory  leamiing,  ^ 

Fi^e  2  shows  the  simulated  2-joint  planar  arm.  The  problem  faced  in  this  simulation  is  to 
corrert  joint  torques  to  drive  the  arm  along  the  desired  trajectory  (the  inverse  dynamics 

problem)  In  addition  to  the  feedforward  control  signal  produced  by  the  network  desaibed  in  this 
paper,  a  feedback  controller  was  also  used. 

Figure, 3  shows  several  learning  curves  for  this  problem.  The  first  point  in  each  of  the  curves  shows 
the  perfonnance  generated  by  the  feedback  controller  alone.  The  error  measure  is  the  RMS  torque 
error  drang  the  movement.  The  highest  curve  shows  the  performance  of  a  nearest  neighbor  method 
without  a  loc^  model.  On  each  time  step  the  nearest  point  was  used  to  generate  the  torques  for 
Lhe  feedforward  command,  which  were  then  summed  with  the  output  horn  the  feedback  controUer. 
The  second  curve  shows  the  performance  using  a  linear  local  model.  The  third  curv^  shows  the 
performance  using  a  quadratic  local  model.  Adding  the  local  model  network  greatly  speeds  up 

earmng.  The  network  with  the  quadratic  local  model  learned  more  quickly  than  the  one  with  the 
local  hnear  model.  ^ 
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Figure  4:  Performance  of  various  methods  on  two  joint  arm  dynamics. 

5  Interference 

To  illustrate  the  differences  between  some  proposed  neural  network  representations  and  a  memory- 
based  representation,  two  neural  network  methods,  CMAC  (Albus  1975ab)  and  sigmoidal  feedforward 

compared  to  the  approach  explored  in  this  paper.  The  parameters  for  the 
CMAC  approach  were  taken  from  Miller,  Glanz,  and  Kraft  (1987)  who  used  the  CMAC  to  model 
mm  inverse  dynamics.  The  architecture  for  the  sigmoidal  feedforward  neural  network  was  taken  from 
Goldberg  and  Pearhnutter  (1988,  section  6)  who  also  modeled  arm  inverse  dynamics. 

methods  to  predict  the  torques  of  the  simulated  two  joint  arm  at 
1000  random  points  was  compared.  Figure  4  plots  the  normalized  RMS  prediction  error.  The  points 
were  sampled  uniformly  using  ranges  comparable  to  those  used  in  (Miller  et  al  1987).  Initially 
each  method  w^  trmned  on  a  training  set  of  1000  random  samples  of  the  two  joint  arm  dynamics 
faction,  and  then  the  pr^ctions  of  the  torques  on  a  separate  test  set  of  1000  random  samples  of 
the  two  jomt  arm  dynamics  function  were  assessed  (points  1,  3,  and  5).  Each  method  was  then 
trained  on  10  attempts  to  make  a  particular  desired  movement.  Each  method  successfully  learned 
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the  desired  movement.  After  this  second  round  of  training,  performance  on  the  random  test  set  was 
again  measured  (points  2,  4,  and  6). 

The  data  indicate  that  the  locally  weighted  regression  approach  (filled  in  circles)  and  the  sio-moidal 
feedforward  network  approach  (asterisks)  both  generalize  weU  on  this  problem  (points  3  and°5  have 
low  error).  The  CM  AC  (diamonds)  did  not  generalize  well  on  this  problem  (point  1  has  a  large 
error),  although  it  represented  the  original  training  set  with  a  normeilized  RMS  error  of  0.000001.  A 
variety  of  CMAC  resolutions  were  explored,  ranging  from  a  basic  CMAC  ceH  size  covering  the  entire 
range  of  data  to  a  cell  size  covering  a  fifth  of  the  data  range  in  each  dimension.  A  cell  size  covering 
one  half  the  data  ranges  in  each  dimension  generalized  best  (the  data  shown  here). 

After  Gaining  on  a  different  training  set  (the  attempts  to  make  a  particular  desired  movement), 
the  sigmoidal  feedforward  neural  network  lost  its  memory  of  the  full  dynamics  (point  4),  and  repre¬ 
sented  only  the  dynamics  of  the  particular  movements  being  learned  in  the  second  training  set.  This 
interference  between  new  and  previously  learned  data  was  not  prevented  by  increasing  the  number 
of  hidden  units  in  the  single  layer  network  from  10  up  to  100.  The  other  methods  explored  did  not 
show  this  interference  effect  (points  2  and  6). 


6  'Tuning  Architectural  Parameters  Globally 

For  the  example  problem  of  two  joint  arm  inverse  dynamics,  we  have  introduced  34  free  parameters 
into  the  local  regression  process:  the  weighting  function  dropoff  parameter  p,  the  6  elements  of  the 
distance  metric  m^,  and  the  27  variable  diagonal  elements  of  A  (the  ridge  regression  parameters  XA. 
The  element  of  A  corresponding  to  the  constant  term,  Ai,  is  held  fixed. 

A  cross  validation  approach  is  used  to  choose  values  for  these  fit  parameters.  For  each  point  a 
query  is  done  to  estimate  the  output  at  that  point,  after  removing  the  point  from  the  database.  The 
difference  between  the  estimate  and  the  actual  value  for  that  point  is  the  cross  validation  error  for 
that  point.  The  sum  of  the  squared  cross  validation  errors  is  minimized  using  a  nonlinear  parameter 
estimation  procedure  (MINPACK  (More,  Garbow,  and  Hillstrom  1980)  or  NL2SOL  (Dennis,  Gay, 
and  Welsch  1981),  for  example)  Because  the  local  model  is  linear  in  the  unknown  parameters  we 
can  analytically  take  the  derivative  of  the  cross  validation  error  with  respect  to  the  parameters  to  be 
estimated,  which  greatly  speeds  up  the  seai-ch  process.  In  the  memory-based  approach  computing 
the  cross  validation  error  for  a  single  point  is  no  more  computationally  expensive  than  answering 
a  query.  This  is  quite  different  from  a  parametric  neural  network,  where  a  new  network  must  be 
trained  for  each  cross  validation  traimng  set  with  a  particular  point  removed. 

The  cross  validation  to  optimize  the  fit  parameters  may  be  done  globaUy,  using  all  the  experiences 
in  the  memory  to  produce  one  set  of  fit  parameters.  Different  fit  parameters  can  be  used  for  different 
outputs.  The  cross  vahdation  may  also  be  done  locally,  either  with  each  query,  or  separately  for 
different  regions  of  the  input  space,  producing  different  sets  of  fit  parameters  specialized  for  particular 

queries,  as  discussed  in  the  next  section. 

We  can  use  the  optimized  distance  metric  to  find  which  input  variables  eire  irrelevcint  to  the 
function  being  represented.  In  the  horizontal  two-joint  arm  inverse  dynamics  problem,  mi,  the 
weight  on  the  distances  in  the  6i  direction  typically  drops  to  zero,  indicating  that  the  input  variable 
IS  irrelevant  to  predicting  ti  and  T2.  This  is  actually  the  case,  as  $i  does  not  appear  in  the  true 
dynamics  equations  for  an  arm  operating  in  a  horizontal  plcine. 

We  can  also  interpret  the  ridge  regression  parameters.  A,-.  Since  the  arm  dynamics  are  linear  in 
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acceleration,  the  terms  m  the  local  model  that  are  quadratic  in  acceleration  (^'2  q  ^  q  qt. 
relent  to  predicting  torques.  Similarly  the  products  of  velocity  and  acceleration  {6,  '*  ^  9,  *  L 
02  *  ,$2  *  92)  are  also  not  relevant  to  the  dynamics.  The  ridge  regression  parameter  for  each’  of  these 

terms  becomes  very  large  m  the  parameter  optimization.  The  effect  of  this  is  to  force  the  estimated 
parameter  A  for  these  terms  to  be  zero  and  the  terms  to  have  no  effect  on  the  regression 

We  have  also  explored  stepwise  regression  procedures  to  determine  which  terms  of  the  local  model 
are  useful  (Draper  and  Smith  1981)  with  similar  results. 

7  Tuning  Architectural  Parameters  Locally 

In  the  process  of  implementing  various  robot  learning  algorithms  it  has  become  clear  to  us  that 
the  architectural  parameters  should  depend  on  the  location  of  the  query  point.  In  this  section  we 
describe  new  procedures  that  estimate  local  values  of  the  fit  parameters  optimized  for  the  site  of  the 
current  query  point.  We  want  to  demonstrate  the  differences  between  local  and  global  fitting  in  an 
example  where  we  only  focus  on  the  kernel  width  A:  of  a  Gaussian  weighting  function.  In  Figure  5a 
a  noisy  data  set  of  the  function  y  =  x  -  sin^(27rx3)  cos(27rx2)  exp(x")  was  fitted  by  locally  weighted 
recession  with  a  globally  optimized  constant  k.  In  the  left  half  of  the  plot,  the  regression  starts  to  fit 
noise  because  k  had  to  be  rather  small  to  fit  the  high  frequency  regions  on  the  right  half  of  the  plot 
The  pre^ction  intervals,  which  will  be  introduced  below,  demonstrate  high  uncertainty  in  several 
places.  To  avoid  such  undesirable  behavior,  a  local  optimization  cnterion  is  needed.  Standarfflinear 
regression  analysis  provides  a  series  of  well-defined  statistical  tools  to  assess  the  quality  of  fits  such 
as  coefficients  of  determination,  t-tests,  F-test,  the  PRESS-statistic,  Mallow’s  Ci>test,  confidence 
mte^als,  prediction  intervals,  and  many  more  (e.g.  Myers  1990).  These  tools  can  be  adapted  to 

locally  weighted  regression.  We  do  not  want  to  discuss  all  possible  available  statistics  here  but  rather 
tocus  on  two  thnt  have  proved  to  be  quite  helpful, 

vahdation  has  a  relative  in  hnear  regression  analysis  caUed  the  PRESS  residual  error.  The 
PRESS  statistic  performs  leave-one-out  cross  validation,  however,  without  the  need  to  recalculate  the 
recession  parameters  for  every  excluded  point.  This  is  computationally  very  efficient.  The  PRESS 
residu^  can  be  expressed  as  a  mean  squared  cross  validation  error  MSE„-o,».  In  Figure  5b  the 
same  data  as  in  Figure  5a  was  fitted  by  adjusting  k  to  minimize  MSE^„  at  each  query  point.’The 
outcome  is  much  smoother  than  that  of  global  a-oss  validation,  and  also  the  prediction  intervals  are 
narrower. ,  It  should  be  noted  that  the  extrapolation  properties  on  both  sides  of  the  graph  are  quite 
appropriate  (compared  to  the  known  underlying  function),  in  comparison  to  Figure  5a  and  Figure  5c 

prediction  error  at  a  query  point  x.-  (Myers 
1990).  Besides  using  the  intervals  to  assess  the  confidence  in  the  fit  at  a  certain  poffit,  they  proride 

another  optimization  measure.  Figure  5c  demonstrates  the  result  when  applying  this  statistic  for 
optimization  of  A  at,  each  query  point.  Again,  the  fitted  curve  is  significantly  smoother  than  the 
globd  CToss^dation  fit.  A  rather  mteresting  and  also  typical  effect  happens  at  the  right  side  of 
•  stating  to  extrapolate,  the  prediction  intervals  suddenly  favor  a  global  regression 

instead  of  the  local  regression,  i.e.,  the  k  was  chosen  to  be  rather  large.  It  turns  out  that  in  local 
optimization  one  always  finds  a  competition  between  local  and  global  regression.  But  sudden  jumps 

tyPicaUy  take  place  only  when  the  prediction  intervals  are  so  large  that 
the  data  is  not  rehable  anyway.  ° 
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8  Assessing  The  Quality  of  the  Local  Model 

Both  the  local  ctoM  validation  error  MSE„„.  and  the  prediction  interval  /,-  may  serve  to  assess  the 
quality  of  the  local  fit: 


Qfit 


VMSE„ 


or 


Qfit  = 


iT-i: 


(9) 


(10) 


The  factor  c  makes  Qf^  dimensionless  and  normalizes  it  with  respect  to  some  user  defined  quantity. 

In  our  apphcations,  we  usually  preferred  Qfu  based  on  the  prediction  intervals,  which  is  the  more 
conservative  assessment. 


9  Dealing  with  Outliers 

weighted  regression, 

although  the  influence  of  outliers  will  not  be  noticed  unless  the  outliers  lie  close  enough  to  a  query 
point.  In  Figure  6a  we  added  three  outliers  to  the  test  data  of  Figure  5  to  demonstrate  this  effect;  the 
plots  in  Figure  6  should  be  compared  to  Figure  5c.  Moore  and  Atkeson  (1993)  applied  the  median 
absolute  deviation  procedure  from  robust  statistics  (Hampbell  et  al.,  1985)  to  globally  remove  outliers 
in  LWR.  Agam,  we  would  like  to  localize  our  criterion  for  outlier  removal.  The  PRESS  statistic  can 

pppcc  ^  standardized  individual 

PRESS  residual.  This  measure  has  zero  mean  and  unit  variance.  If,  for  a  given  data  point  it  deviates 

from  zero  more  than  a  certain  threshold,  the  point  can  be  caUed  an  outher.  A  conservative  threshold 
would  be  1.96,  discarding  all  points  lying  outside  the  95%  area  of  the  normal  distribution.  In  our 
applications,  we  used  2.57,  cutting  off  all  data  outside  the  99%  area  of  the  normal  distribution.  As 
can  be  seen  m  Figure  6b,  the  effects  of  the  outliers  are  reduced. 

10  A  Testbed  for  Learning  Algorithms:  Robot  Juggling 

We  have  constructed  a  system  for  experiments  in  real-time  motor  learning  (Van  Zyl  1991)  The  task 
IS  a  jugglmg  task  Imown  as  “devil  sticking”.  A  center^stick  is  batted  back  and  forth  between  two 
handsti^  (Fi^e  7A)^Figure  7B  shows  a  sketch  of  our  devil  sticking  robot.  The  juggling  robot  uses 
Its  top  two  join^to  perform  planar  devil  sticking.  Hand  sticks  are  mounted  on  the  robot  with  springs 
^d  dampers.  implements  a  passive  catch.  The  center  stick  does  not  bounce  when  it  hits  the 
hand  stick,  ^d  therefore  requires  an  active  throwing  motion  by  the  robot.  To  simplify  the  problem 
the  center  stick  is  constrained  by  a  boom  to  move  on  the  surface  of  a  sphere.  For  small  movements 
the  center  stick  movements  are  approximately  planar.  The  boom  also  provides  a  way  to  measure 
he  current  state  of  the  center  stick.  Ultimately  we  want  to  be  able  to  perform  unconstrained  three 
dim^ional  devil  stickmg  using  vision  to  provide  sensing  of  the  center  stick  state. 

®iif  predicted  location  of  where  the  center  stick  would  hit  the  hand  stick  if  the 
hand  stick  was  held  m  a  nommal  position.  Standard  ballistics  equations  for  the  flight  of  the  center 
stick  are  used  to  map  flight  trajectory  measurements  (a;(t),y(t),0(t))  into  a  task  state; 


X  (Pi  *) 


(11) 
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-  y-predicted 

— r-  prediction  intervaJ 
*  noisy  data 

Ql 


p  is  the  distance  from  the  middle  of  the  center  stick  that  the  hand  stick  at  the  nominal  position 
contads  the  cento  stni,  «  ,s  the  angle  of  the  center  stick  at  nominal  contact,  and  i,  j,  and  «  are 
the  veloaties  and  angular  velocity  of  the  center  stick  at  nominal  contact 

The  task  commtod  is  given  by  a  displacement  of  the  hand  stick  from  the  nominal  position  (urn) 

veLXvtoorTv!  C)  ^  ^ 

(^h,yh,^t,Vz,Vy)  (22) 

Every  time  the  robot  catches  and  throws  the  devil  stick  it  generates  an  experience  of  the  form 
lx*,Ufc  Xfc+i)  where  x*  is  the  current  state,  u*  is  the  action  performed  by  the  robot,  and  Xt.,  is 

e  s  ate  of  the  center  stick  after  the  hit.  Thus,  a  forward  or  an  inverse  model  would  have  10  input 
dimensions  and  5  output  dimensions.  ^ 

ImtiaUy  we  explored  learning  an  inverse  model  of  the  task,  using  nonlinear  “deadbeat”  control 
to  attempt  to  ehimnate  aU  error  on  eacla  hit.  Each  hand  had  its  own  inverse  model  of  the  form: 


Ujt  =  /  ^{xjfe,Xjfe+i) 


(13) 


Before  each  Wt  the  system  looked  up  a  command  with  the  predicted  nominal  impact  state  and  the 
desired  result  state: 

ujk  =  /-^{x*,xj)  (24) 

Inverse  model  learning  was  successfully  used  to  train  the  system  to  perform  the  devil  sticking 

^  achieved.  The  system  incorporated  new  data  in  real  time,  and 

used  databases  of  several  hundred  hits.  Lookups  took  less  than  20  milliseconds,  and  therefore  several 
lookups  could  be  performed  before  the  end  of  the  flight  of  the  center  stick.  Later  queries  incorporated 
more  “e^^ements  of  the  ffight  of  the  center  stick  and  therefore  more  accurate  predictions  of  the 
state  of  the  task.  However,  the  system  required  substantial  structure  in  the  initial  training  to  achieve 
this  performance.  The  system  was  started  with  a  default  command  that  was  appropriate  for  open 
oop  pertormance  of  the  task.  Each  control  parameter  was  varied  systematically  to  explore  the 
space  near  the  default  command.  A  global  linear  model  was  made  of  this  initial  data,  and  a  linear 
controUer  based  on  tms  model  was  used  to  generate  an  initial  training  set  for  the  memory-based 
syst^  (approximately  100  hits).  Learning  witn  small  amounts  of  initial  data  was  not  possiWe. 

We  also  expenmented  with  learning  based  on  both  inverse  and  forward  models.  After  a  command 
IS  generated  by  the  mverse  model,  it  can  be  evaluated-^using  a  memory-based  forward  model  with 

DI16  S3Jni6  dcltclC 

= /(xfc.Ufc)  (25^ 

Because  it  Produces  a  local  linear  model,  the  locally  weighted  regression  procedure  will  produce 
estimates  of  the  dematives  of  the  forward  model  with  respect  to  the  commands  as  part  of  the 

vfrw  derivatives  can  be  used  to  find  a  correction  to  the  command 

vector  that  reduces  errors  in  the  predicted  outcome  based  on  the  forward  model. 


A  *  - 

— Au*  =  Xfc+i  -Xrf 


(16) 


The  pseudo-inverse  of  the  matrix  df/du  is  used  to  solve  the  above  equation  for  Au*,  to  handle 
situation  in  wluch  the  matrix  is  singular  or  there  are  a  different  number  of  commands  and  states 
(which  does  not  apply  for  devil  sticking).  TWs  process  of  command  refinement  can  be  repeated  until 
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the  forward  modd  no  longer  produces  accurate  predictions  of  the  outcome.  This  wiU  happen  when 
the  query  to  the  forward  model  requires  significant  extrapolation  from  the  current  database.  The 
distance  to  the  nearest  stored  data  point  can  be  used  as  a  crude  measure  of  the  validity  of  the  forward 

We  investigated  this  method  for  incremental  learning  of  devil  sticking  in  simulations.  The  out- 
come  however,  did  not  meet  expectations:  without  sufficient  initial  data  around  the  setpoint  the 
algorithm  did  not  work.  We  see  two  reasons  for  this.  First,  similar  to  the  pure  inverse  model  ap¬ 
proach,  the  inverse-forward  model  acts  as  a  one-step  deadbeat  controller  in  that  it  tries  to  eliminate 
error  m  one  time  step.  One-step  deadbeat  control  applies  unreasonably  large  commands  to  correct 
for  devmtions  from  the  setpoint.  The  workspace  bounds  and  command  bounds  of  our  devil  sticking 
robot  himt  the  size  of  the  commands.  In  addition,  deadbeat  control  in  the  presence  of  errors  in 
he  model  seems  to  lead  to  large  inappropnate  commands.  Second,  the  ten  dimensional  input  space 
IS  large,  and  even  if  experiences  are  uniformly  randomly  distributed  in  the  space  there  is  often  not 
enough  data  near  a  particular  point  to  make  a  robust  inverse  or  forward  model. 

f  added  to  the  devil  sticking  controUer.  First,  the  controller  should 

not  be  deadbeat.  It  should  plan  to  attain  the  goal  using  multiple  control  actions.  Second,  the  control 
must  have  as  the  pnmary  goal  increasing  the  data  density  in  the  current  region  of  the  state-action 
space,  and  as  a  secondary  goal  to  arrive  at  the  desired  goal  state.  Both  requirements  are  fulfilled  by 
a  simple  ^loration  aJgonthm  we  have  developed,  the  shifting  setpoint  algorithm  (SSA).  Apphed  to 
devil  sticlang,  the  SSA  proceeds  as  follows: 

1.  R'gardltss  of  the  poor  juggling  quality  of  the  robot  {i.e.,  at  most  two  or  three  hits  per  tri^,  the 
SSA  makes  the  robot  repeat  these  initial  actions  with  smaU  random  perturbations  until  a  cloud 

of  data  was  collected  somewhere  in  state-action  space  of  each  hand.  An  abstract  illustration 
lor  this  IS  given  in  Figure  8. 

2.  Each  point  in  the  data  cloud  of  each  hand  is  used  as  a  candidate  for  a  setpoint  of  the  corre¬ 
sponding  hand  by  trying  to  predict  its  output  from  its  input  with  locaUy  weighted  regression 
The  point  achieving  the  narrowest  local  confidence  interval  becomes  the  setpoint  of  the  hand 
^  ^  Zn.  qufratic  (LQ)  controUer  is  calculated  from  its  local  Unear  model  (Anderson  and 
Moore,  1990).  By  means  of  these  controUers,  the  amount  of  data  around  the  setpoints  can 
quickly  be  increased  until  the  quaUty  of  the  local  models  exceeds  a  chosen  statistical  threshold. 

3.  At  this  pomt,  the  setpoints  are  graduaUy  shifted  towards  the  goal  setpoints  untU  the  data 
support  of  the  local  models  faUs  below  a  statistical  value.  After  shifting,  the  smoothing  kernel 
IS  optimized  by  minimizing  the  local  cross  vaUdation  error. 

4.  The  SSA  continues  by  coUecting  data  in  the  new  regions  of  the  workspace  until  the  setpoints 
can  be  shifted  apn  (Fig.  8  bottom-left).  The  procedure  terminates  by  reaching  the  goal 
leaving  a  (hyper-)  ridge  of  data  in  space  (Figure  8  bottom-right). 

The  linear  quadratic  controUers  play  a  crucial  role  for  devil  sticking.  It  is  difficult  to  build  good 
local  bnear  models  in  the  high  dimensional  forward  models,  particularly  at  the  beginning  of  learning. 

mess  quadratic  control  is  robust  even  if  the  underlying  Unear  models  are  imprecise.  We  tested  the 
and  Rg^nr  simulation  and  on  the  real  robot.  Learning  curves  are  given  in  Figure  9a 
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Fig^e  9:  Learning  a^ea  of  devil  atiddng  using  the  SSA  algorithm  (a)  simulation  results  (individual 
trials  were  stopped  after  200  hits  were  reached),  (b)  real  robot  results. 
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Table  1:  Comparison  of  parametric  and  memory-based  approaches 


Training 

Lookup 

Tuning 

Nonlinear 

Parametric 

Model 

Nonlinear 

Parameter 

Estimation 

Cheap 

? 

Memory- 

Based 

Model 

Cheap 

- 1 

Linear 

Parameter 

Estimation 

Nonlinear 

Parameter 

Estimation 

The  learmng  curves  are  typical  for  the  given  problem.  It  takes  roughly  40  trials  before  the 
setpoint  of  each  hand  has  moved  close  enough  to  the  other  hand’s  setpoint.  For  the  simulation  a 
break-throu^  occurs  and  the  robot  rarely  loses  the  devilstick  after  that.  The  real  robot  takes  more 
trials  to  achieve  longer  juggling  runs,  and  its  performance  is  less  consistent.  The  devil  stiddng^robot 
IS  a  very  fast  robot,  but  its  positioning  accuracy  achieves  at  most  ±1  cm.  Additionally,  the  direct 
^ive  motors  do  not  always  deliver  the  torque  as  commanded.  The  simulation  does  not  model  such 
distujbances.  It  only  copes  with  various  levels  of  Gaussian  noise,  which  is  rather  weU-behaved  in 
comparison  to  what  the  real  robot  experiences.  On  average,  humans  need  roughly  a  week  of  one 
hour  practice  a  day  before  they  learn  to  juggle  the  devilstick.  With  respect  to  tins,  the  robot  learned 

rather  quickJj.  Future  work  wiU  attempt  to  improve  the  learning  rate  and  robustness;  the  results 
shown  stem  from  very  recent  work. 


11  Discussion 

Memory-based  neural  networks  are  useful  for  motor  le^^ning.  Fast  training  is  achieved  by  modular¬ 
izing  the  network  architecture;  the  memory  network  does  not  need  to  search  for  weights  in  order  to 
store  the  samples,  and  local  models  can  be  linear  in  the  unknown  parameters,  leading  to  a  single 
optimum  which  can  be  found  by  linear  regression  or  gradient  descent.  The  combination  of  storing 
all  the  data  and  only  using  a  certain  number  of  nearby  samples  to  form  a  local  model  m.inimizes 

interference  between  old  and  new  data,  and  allows  the  range  of  generalization  to  depend  on  the 
density  of  the  samples; 

it  IS  useful  to  compare  memory-based  function  approximation  and  other  nonlinear  parametric 
modehng  approaches  (Table  1).  Training  a  memory-based  model  is  computationally  inexpensive,  as 
the  data  is  simply  stored  in  a  memory.  Training  a  uonhnear  parametric  model  typically  requires  an 
Iterative  sem<i  for  the  appropriate  parameters.  Examples  of  iterative  search  are  the  various  gradient 
descent  tedmiques  used  to  train  neural  network  models.  Lookup  or  evaluating  a  memory-based 
model  IS  computation^y  expensive,  as  described  in  this  paper.  Lookup  in  a  nonlinear  parametric 
model  IS  often  relatively  inexpensive.  If  there  is  a  situation  in  which  a  fixed  set  of  training  data  is 
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available,  and  there  wiU  be  many  queries  to  the  model  after  the  training  data  is  processed,  then  it 
mahes  seme  to  use  a  nonlinear  parametric  model.  However,  if  there  is  a  continuous  stream  of  new 
traimng  data  intermixed  with  queries,  as  there  typically  is  in  many  motor  learning  problems,  it  may 

be  less  expensive  to  tram  and  query  a  memory-based  model  then  it  is  to  train  and  query  a  nonlinear 
parametric  model. 

A  potential  disadvantage  of  the  memory-based  approach  is  the  limited  capacity  of  the  memory 
network  In  this  version  of  the  proposed  neural  network  architecture,  every  experience  is  stored. 
Eventually  all  the  memory  units  will  be  used  up.  We  have  not  yet  needed  to  address  this  issue  in 
our  experiments.  However,  we  plan  to  explore  how  memory  use  can  be  minimized  based  on  several 
approa^es.  One  approach  is  to  only  store  “surprises".  The  system  would  try  to  predict  the  outputs 
of  a  data  point  before  trying  to  store  it.  If  the  prediction  is  good,  it  is  not  necessary  to  store  the 
point.  Another  approach  is  to  forget  data  points.  Points  can  be  forgotten  or  removed  from  the 
database  based  on  age,  proximity  to  queries,  or  other  criteria.  It  is  an  empirical  question  as  to  how 
arge  a  memory  capacity  is  necessary  for  this  network  design  to  be  useful.  Because  memory-based 
learmng  retams  the  original  training  data,  forgetting  can  be  explicitly  controlled. 

The  cross  validation  approach  to  optimizing  the  fit  parameters  reduces  the  number  of  arbitrary 
choices  that  need  to  be  made  before  the  traimng  data  is  coUected.  However,  like  other  modeling 
approaches,  the  Aoice  of  representation  of  the  data  (number  and  selection  of  dimensions  to  be 
measured,  etc.)  play  a  large  role  in  determining  the  success  of  the  approach. 

In  this  lear^g  paradigm  the  feedback  controller  serves  as  the  teacher,  or  source  of  new  data  for 
the  network.  If  the  feedback  controller  is  of  poor  quality,  the  nearest  neighbor  function  approximation 
method  tends  to  get  “stuck”  with  a  non-zero  error  level.  The  use  of  a  local  model  seems  to  eliminate 
this  stuck  state,  and  reduce  the  dependence  on  the  quality  of  the  feedback  controller. 

Much  work  remains  ahead  in  developing  new  learning  paradigms.  We  need  to  develop  leaining 
systems  that  maintain  multiple  levels  of  models,  allowing  generalization  via  abstract  models  of  the 
tasx.  We  need  paradigms  that  are  capable  of  finding  new  strategies  for  a  task,  and  learning  and 
generahzmg  across  multiple  tasks.  We  look  forward  to  paradigms  that  perform  qualitative  physical 

reasomng  and  guide  learning  using  this  information.  FinaUy,  careful  control  of  exploration  is  needed 
tor  improvements  in  learmng  efficiency. 
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ABSTRACT 

P.tt,-game  is  .  n.w  algorithm  for  Uarning  from  delayed  rewards  in  high  dimensional  eontinuous  state-spaces. 
In  high  dimensions  it  is  essentiai  that  learning  does  not  explore  or  plan  over  state-space  uniformly.  Parti- 
game  maintains  a  dedsion-tree  partitioning  of  state-space  and  applies  techniques  from  game-theory  and 
computation^  geometry  to  efficiently  and  adaptively  concentrate  high  resolution  only  on  critical  areas  The 
current  version  of  the  algorithm  is  designed  to  find  feasible  solution,  to  high  dimensional  problems.  Future 
versions  will  he  designed  to  find  a  solution  that  optimires  a  teal-valued  criterion.  Many  simulated  problems 
have  been  tested,  tanging  from  twcedimensional  to  nine-dimensional  stat^paces,  including  mates,  path 

planning,  non-linem  dynamics,  and  plan„  snake  robots  in  restricted  spaces.  In  all  cases,  a  gcKid  solution  is 
found  in  less  than  twenty  trials  and  a  few  minutes. 
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1  Reinforcement  Learning 

Reinforcement  learning  [Michie  and  Chambers,  1968,  Sutton,  1984,  Watkins,  1989,  Barto  et  ai, 
1991]  is  a  promising  method  for  robots  to  program  and  improve  themselves.  This  paper  addresses 
one  of  reinforcement  learning’s  biggest  stumbhng  blocks:  the  curse  of  dimensionaUty  [Bellman, 
1957],  in  which  costs  increase  exponentially  with  the  number  of  state  variables.  These  costs  include 
both  the  computational  effort  required  for  planning  and  the  physical  amount  of  data  that  the 
control  system  must  gather. 

Much  work  has  been  performed  with  discrete  state-spaces:  in  particular  a  class  of  Markov 
decision  tasks  known  as  grid  worlds  [Watkins,  1989,  Sutton,  1990a].  Most  potentiaUy  useful  appU- 
cations  of  reinforcement  learning,  however,  take  place  in  multidimensional  continuous  state-spaces. 
The  obvious  way  to  transform  such  state-spaces  into  discrete  problems  involves  quantizing  them: 
partitioning  the  state-space  into  a  multidimensional  grid,  and  treating  each  box  within  the  grid  a^. 
an  atomic  object.  Although  this  can  be  effective  (see,  for  instance,  the  pole  balancing  experiment! 
of  [Michie  and  Chambers,  1968,  Barto  et  ai,  1983]  which  break  state-space  up  in  this  way),  thfr 
naive  grid  approach  has  a  number  of  dangers  which  will  be  detailed  in  this  paper. 

This  paper  studies  in  detail  the  pitfaUs  of  discretization  during  reinforcement  learning  and 
then  introduces  the  parti-game  algorithm.  Some  earlier  work  [Simons  et  ai,  1982,  Moore,  1991, 
Chapman  and  Kaelbling,  1991,  Dayan  and  Hinton,  1993]  considered  recursively  partitioning  state- 
space  while  learmng  from  delayed  rewards.  The  new  ideas  in  the  parti-game  algorithm  include  (i)  a 
game-theoretic  splitting  criterion  to  robustly  choose  spatial  resolution,  (ii)  real-time  incremental 
maintenance  and  planning  with  a  database  of  previous  experiences,  and  (iii)  using  local  greedy 
controllers  for  high-level  “funneling”  actions.  - 

2  Assumptions 

The  parti-game  algorithm  applies  to  learning  control  problems  in  which: 

1.  State  and  action  spaces  are  continuous  and  multidimensional. 

2.  Greedy  and  hill-climbing  techniques  can  become  stuck,  never  attain¬ 
ing  the  goal. 

3.  Random  exploration  can  be  intractably  time-consuming. 

4.  The  system  dynamics  and  control  laws  can  have  discontinuities  and 
are  unknown;  they  must  be  learned.  However,  we  do  assume  that  all 
paths  that  the  system  can  travel  through  state  space  are  continuous. 
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The  experiments  reported  later  all  have  properties  1-4.  However,  the  initial  algorithm,  described 
and  tested  here,  has  the  following  restrictions: 

5.  Dynamics  are  deterministic. 

6.  The  task  is  specified  by  a  goal  state,  not  an  arbitrary  reward  function. 

7.  The  goal  state  is  known. 

8.  A  feasible  solution  is  found,  not  necessarily  a  path  which  optimizes  a 
particular  criterion. 

9.  A  local  greedy  controller  is  available,  which  we  can  ask  to  move  greedily 
towards  any  desired  state.  There  is  no  guarantee  that  a  request  to  the 
greedy  controller  will  succeed.  For  example,  in  a  maze  a  greedy  path 
to  the  goal  would  quickly  hit  a  wall. 

This  paper  begins  by  giving  a  series  of  algorithms  of  increasing  sophistication,  culminating  in  parti- 
game.  We  then  give  results  for  a  number  of  experimental  domains  and  conclude  with  discussion  o§ 
how  constraints  5-9  may  be  relaxed.  I 

3  The  Parti-game  Algorithm 

The  parti-game  learns  a  controller  from  a  start  region  to  a  goal  region  in  a  continuous  state- 
space.  We  now  give  four  increasingly  effective  algorithms  which  attempt  to  perform  this  by  discrete 
partitionings  of  state-space.  Algorithms  (1)  and  (2)  are  non-learning:  they  plan  a  route  to  the  goal 
given  a-priori  knowledge  of  the  world.  Algorithms  (3)  and  (4)  must  learn,  and  hence  explore,  the 
world  while  planning  a  route  to  the  goal.  Algorithm  (1)  is  a  planner  which  assumes  that  state 
transitions  begin  at  the  center  of  partitions,  and  generalizes  this  to  the  assumption  of  starting 
randomly  within  a  partition.  Algorithm  (2)  avoids  some  of  (l)’s  mistakes  by  means  of  worst-case 
planning.  Algorithm  (3)  is  a  learning  version  of  (2).  Algorithm  (4)  is  the  parti-game  algorithm. 

It  develops  a  variable  resolution  partitioning  in  conjunction  with  the  planning  and  learning  of 
Algorithm  (3). 

3.1  Algorithm  (1):  Non-learning  and  fixed  partitions 

A  parttUontng  of  a  continuous  state-space  is  a  finite  set  of  N  disjoint  regions  labeled  1,2,  ...N  such 
that  the  whole  of  state  space  is  covered  by  the  union  of  all  partitions.  Throughout  this  paper 
we  will  assume  the  partitions  are  all  axis-aligned  hyperrectangles,  though  this  assumption  is  not 
strictly  necessary.  It  is  important  to  clarify  a  potential  confusion  between  real-valued  states  and 
partitions.  A  real-valued  state,  s,  is  a  real-valued  vector  in  a  multidimensional  space.  For  example. 
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states  from  the  maze  depicted  in  Figure  1  are  two-dimensional  (x,y)  coordinates.  A  partition  is  a 
discrete  entity,  and  Figure  1  is  broken  into  six  partitions  with  identifiers  1 . .  .6.  Each  real- valued 
state  is  in  one  partition  and  each  partition  contains  a  continuous  set  of  real-valued  states.  Define 
NEIGHS(t)  as  the  set  of  partitions  which  are  adjacent  to  i.  In  Figure  1,  NEIGHS(l)  =  {2,4}. 


Figure  1:  A  two-dimensional  continuous  maze 
with  one  barrier:  the  black  polygonal  region 
near  the  bottom  right.  State-space  has  been 
discretized  into  six  square  partitions. 


Algorithm  (1)  takes  as  input  an  environmental  model  and  a  partitioning  P.  The  environmental 
model  can  be  any  model  (for  example,  dynamic  or  geometric)  which  we  can  use  to  teU  us  for  any 
real- valued  state,  control  command  and  time  interval,  what  the  subsequent  real- valued  state  will  be. 
The  algorithm  outputs  a  policy,  a  mapping  from  partition  identifiers  to  the  neighboring  partitions 
which  should  be  aimed  for.  The  algorithm  depends  upon  the  NEXT-PARTITION  function,  which 
we  define  first.  NEXT-PARTITION  teUs  us  which  partition  we  end  up  in  if  we  start  at  a  given 
real-valued  state  and  keep  moving  toward  the  center  of  a  given  partition  (using  a  local  greedy 
controller)  until  we  either  exit  our  initial  partition  or  get  stuck.  Let  i  be  the  partition  containing 
real  valued  state  s.  Continue  applying  the  local  greedy 'controller  “aim  at  partition  j"  either  until 
partition  i  is  exited  or  until  we  become  permanently  stuck  in  i.  Then 


NEXT-PARTITION(s,j)  = 


(1) 


i  if  we  became  stuck 

the  partition  containing  the  exit  state  otherwise 
The  test  for  sticking  can  simply  be  implemented  as  a  test  to  see  if  the  system  has  not  exited  the 
partition  after  a  predefined  time  interval.  Depending  upon  the  application  other  sticking  detectors 
are  possible,  such  as  an  obstacle  sensor  on  a  mobile  robot. 


Algorithm  (1)  works  by  constructing  a  discrete,  deterministic  Markov  decision  task  (MDT)  [Bell- 
man,  1957,  Bertsekas  and  Tsitsiklis,  1989]  in  which  the  discrete  MDT  states  correspond  to  parti¬ 
tions.  Actions  correspond  to  neighbors  thus;  action  k  in  partition  i  corresponds  to  starting  at  the 
center  of  partition  t  and  greedily  aiming  at  the  center  of  partition  k. 
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ALGORITHM  (1). 

1  Given  N  partitions,  construct  a  deterministic  MDT  with  N  discrete  states  1  •  •  -  iV.  The  set 
of  actions  of  partition  i  is  precisely  NEIGHS(i).  Define  NEXT(i,A:)  as 


NEXT(z,/;)  =  NEXT-PARTITION(CENTER(i),A:) 


(2) 


where  CENTER(t)  is  the  real- valued  state  at  the  center  of  partition  i. 


2  The  shortest  path  to  the  goal  from  each  partition  t,  denoted  by  Jsp{i),  is  determined  by 
solving  the  set  of  equations: 


Jsp{i)  =  ^ 


[  JfeeNEIGHS(t)  •^5P(NEXT(i,fc)) 


if  i  =  GOAL 
Otherwise 


(3) 


The  equations  are  solved  by  a  shortest-path  method  such  as  dynamic  programming  [Bellman^ 
1957,  Bertsekas  and  Tsitsiklis,  1989]  or  Dijkstra’s  algorithm  [Knuth,  1973]. 


3  The  following  policy  is  returned:  Always  aim  for  the  neighbor  with  the  lowest  JspQ  value. 

This  simple  algorithm  has  immediate  drawbacks.  It  will  minimize  the  number  of  partitions  to  the 
goal,  not  the  real  distance.  And  the  discretization  can  easily  find  impossible  solutions  or  faU  to 
find  valid  solutions.  As  an  example  of  the  former,  in  Figure  1,  Algorithm  (1)  would  find  solution 
path  5  -)•  6  -*•  3.  This  is  because  it  is  possible  to  travel  from  the  center  of  5  and  enter  6  (in  the 
part  of  6  to  the  left  of  the  obstacle),  and  it  is  possible  to  travel  from  the  center  of  6  and  enter  3. 

An  extension  to  Algorithm  (1)  might  initially  appear  Jo  solve  the  problem.  We  could  remove  the 
assumption  that  all  paths  between  partitions  begin  at  the  center  of  the  source  partition.  Suppose 
we  produce  a  stochastic  Markov  decision  task.  Let  p^-  be  an  approximation  of  the  probability  of 
transition  to  partition  j  given  we  have  started  in  i  and  aimed  at  the  center  of  k.  p’[j  is  defined 
by  the  probability  we  end  up  in  partition  k  from  a  uniformly  randomly  chosen  legal  start  point 
in  partition  i.  The  dynamic  programming  step  of  the  previous  algorithm  is  altered  so  that  it  now 
solves  the  stochastic  MDT: 

if  i  =  GOAL 

min  ^  (4) 

Jt  6  NEIGHS(.)  E  PijJspU)  Otherwise  ^ 

Although  intuitively  appeahng,  this  refinement  does  not  help.  In  the  example  of  Figure  1  the 
resultant  poUcy  from  state  5  will  still  be  to  aim  for  6.  As  we  see  from  Figure  2,  pfe  =  0.65,  and  from 
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Figure  3,  —  0.91.  The  policy  5  — »■  6  — +  3  is  interpreted  as  the  transition  graph  in  Figure  4  which 

has  expected  length  1/0.65  +  1/0.91  =  2.64,  and  so  is  preferred  over  the  longer  but  guaranteed 
policy  of5-»'4-+1^2->^3. 


Figure  2:  Approximately  65%  of  the  starting 
states  (those  in  the  shaded  region)  in  partition 
5  are  such  that  they  will  enter  partition  6  if  we 
aim  for  the  center  of  6.  Thus  =  0.65. 


Figure  3:  In  a  similar  fashion  to  Figure  2,  pi3  = 

0.91. 


Figure  4;  The  partition  transition  probabilities 
if  we  follow  the  5  — +  6  — ♦  3  policy  according  to 
the  assumptions  in  the  text. 


Other  variants  of  this  stochastic  approximation  approach  are  possible,  but  they  all  suffer  from 
the  same  problem.  They  are  using  a  Markov  decision  formalism  for  something  which  does  not  have 
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the  Markov  property.  This  is  because  from  a  given  partition,  i,  the  neighbors  which  can  successfuUy 
be  reached  depend  on  more  than  “i”,  they  also  depend  on  the  current  location  within  i. 


3.2  Algorithm  (2):  Assuming  the  worst  case 

Instead  of  approximating  the  steps-to-goal  value  of  a  partition  by  the  average  steps-to-goal  of  all 
real-valued  states  in  the  partition,  we  approximate  it  by  the  worst  value.  As  before,  each  partition 
has  an  associated  set  of  actions,  each  labeled  by  a  neighboring  partition.  Also,  each  action  now 
has  a  set  of  possible  outcomes.  The  outcomes  of  an  action  j  in  a  partition  i  are  defined  as  the  set 
of  possible  next  partitions. 


OUTCOMES(f 


*,;■)  = 


(5) 


there  exists  a  real  valued  state  s  in  partition  i  for  which  "j 

NEXT-PARTITION(s,  j)  =  k  j 

In  Figure  5,  the  actions  are  denoted  by  black  solid  arrows  and  the  outcomes  by  the  thin-lines:| 
For  example,  partition  5  has  three  actions:  “aim  at  4”,  “aim  at  2”  and  “aim  at  6”.  The  “aim  ai 
6”  action  has,  in  turn,  two  possible  outcomes.  We  might  make  it  if  we  are  lucky,  or  else  we  will 


remain  in  5.  The  OUTCOMES()  sets  are  decisions  which  an  imaginary  adversary  will  be  adlowed 
to  make,  seen  in  Algorithm  (2). 


Figure  5:  The  symbolic  representation  of  the 
original  problem  as  partitions  (circles),  actions 
(solid" arrows)  and  outcomes  (thin  arrows). 


ALGORITHM  (2). 

1  Define  Jwc{i)  as  the  minimum  number  of  partitions  to  the  goal  under  the  worse-case  asummp- 
tion  that  whenever  we  have  specified  our  current  partition  t  and  our  intended  next  partition 

i,  an  adversary  is  permitted  to  place  us  in  the  worst  position  within  partition  i  prior  to  the 
local  controller  being  activated. 
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Solve  the  following  set  of  minimax  equations: 


if  i  =  GOAL 

min  max  7  7^.,  .  (6) 

k  €  NEIGHS(i)  j  6  OUTCOMES(,-,  k)  Otherwise 

where  Jwc[i)  is  allowed  to  take  the  value  +00  to  denote  a  partition  from  which  our  adversary 
can  permanently  prevent  us  reaching  the  goal.  CaU  such  a  partition  a  losing  partition. 

2  The  following  policy  is  returned:  Always  aim  for  the  neighbor  with  the  lowest  Jwc{)  value. 

The  Jwci-)  function  can  be  computed  by  a  standard  minimax  algorithm  [Knuth,  1973],  which  is 
in  turn  closely  related  to  deterministic  dynamic  programming  algorithms. 

This  algorithm  is  pessimistic,  but  if  it  telk  us  that  the  current  partition  is  n  partition  transitions 
to  the  goal  then  we  can  be  sure  that  if  we  follow  its  policy  we  will  indeed  take  n  or  fewer  partition 
transitions.  The  trivial  inductive  proof  is  omitted.  :2 

In  Figure  1,  Algorithm  (2)  will  decide  that  partition  5  is  four  steps  from  the  goal  and  wilH 
recommend  heading  towards  4.  It  avoids  partition  6  because  the  minimax  assumption  scores 
partition  6  as  being  00  steps  from  the  goal.  This  is  because  if,  in  partition  6,  we  decide  to  use 
action  “aim  for  3”  the  adversary  will  start  us  in  the  bottom  left  of  partition  6.  And  if  we  use  action 
“Aim  for  5”,  the  adversary  will  start  us  in  the  bottom  right  of  partition  6. 

It  should  be  observed  just  how  pessimistic  the  algorithm  is.  In  the  almost  entirely  empty  maze 
of  Figure  6  the  start  partition  will  be  considered  a  losing  partition.  So  although  the  minimax 
assumption  guarantees  success  if  it  finds  a  solution,  it  may  often  prevent  us  from  solving  easy 
problems.  We  will  see  that  Algorithm  (3)  reduces  the  severity  of  this  problem  because  instead  of 
considering  the  worst  of  all  possible  outcomes,  the  planner  only  considers  the  worst  of  all  empirically 
observed  outcomes.  Thus  a  block  in  a  piece  of  a  partition  which  never  was  actuaUy  visited  would 
not  be  identified  as  an  outcome  available  to  the  adversary.  Algorithm  (4)  fully  solves  the  remaining 
aspects  of  the  problem  by  increasing  the  resolution  of  losing  partitions. 

3.3  Algorithm  (3):  A  learning  version  of  Algorithm  (2) 

An  important  aim  of  this  work  is  to  have  a  controUer  which  does  not  begin  with  an  environmental 
model,  but  which  manages  instead  to  learn  purely  from  experience.  Algorithm  (2)  can  be  extended 
to  permit  this.  The  set  of  OUTCOMES(z,  j)  for  each  partition  *  and  neighbor  j  can  be  obtained 
empirically.  Whenever  an  OUTCOMES(i,;)  is  altered,  the  game  is  solved  with  the  new  outcomes 
set.  We  stiU  assume  that  the  location  of  the  partition  containing  the  goal  is  known. 


0 

Jwc{i)  =  < 

1  + 
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Figure  6:  Partition  3  is  scored  as  losing  because 
if  it  aims  for  1  the  adversary  can  place  it  below 
the  upper  triangular  block  and  if  it  aims  for  4 
the  adversary  can  place  it  to  the  left  of  the  lower 
triangular  block. 


A  further  detail  must  be  resolved.  In  the  early  stages,  what  should  be  done  for  those  actions 
which  have  not  yet  been  experienced?  The  answer  is  to  assume  by  default  that  any  neighbor  aimed 
for  can  be  attained.  Algorithm  (3),  based  on  these  ideas,  takes  three  inputs; 

•  The  current  real-valued  system  state  s.  1 

•  A  partitioning  of  state-space,  P. 

.  A  database,  D,  of  aU  previously  observed  partition-transitions  in  the  lifetime  of  the  system. 
This  is  a  set  of  triplets: 

(starting  in  zq,  I  aimed  for  Jq  and  actually  arrived  in  ibo) 

(starting  in  t'l,  I  aimed  for  ji  and  actually  arrived  in  Ari) 

i  •  -  ^ 

The  elBorithm  returns  two  outputs:  The  huul  system  state  after  execution  and  a  binary  signal 
indicating  SUCCESS  or  FAILURE.  The  database  is  also  updated  according  to  experience. 

algorithm  (3); 

REPEAT  FOREVER 

1  Compute  the  OUTCOMES(i,  j)  set  for  each  partition  i  and  each  neighbor  j  e  NEIGHSfi) 
thus: 


1 

GOAL 

2 

START 

^3 

4 

•  If,  for  any  k,  ii,j,k)  e  D  then: 


OUT.COMES(i,;)  =  {A; 


€  D} 


(7) 
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•  E1s6,  US6  th6  optimistic  assumption  in  the  absence  of  experience: 


OUTCOMES(z,j)  =  {j} 

2  Compute  Jyvc()  for  oach  partition  using  minimax. 

3  Let  i  the  partition  containing  s. 

4  If  iz=  GOAL  then  exit,  signaling  SUCCESS. 

5  If  Jivc(t)  =  oo  then  exit,  signaling  FAILURE. 

6  Else 


LOOP 


6.1  Let  j  :*  ^rgmin 

WHILE  (  not  stuck  and  s  is  still  in  partition  i ) 

6.2.1  Actuate  local  controller  aiming  at  j. 

6.2.2  s  :=  new  real- valued  state. 

6.3  Let  inew  :=  the  identifier  of  the  partition  containing  s. 
6A  D  D  U 


(8) 


An  addition  to  the  algorithm  can  reduce  the  computational  load.  K  real  time  constraints  do  not 
permit  full  recomputation  of  Jivc  after  an  outcome  set  haa  changed,  then  the  Jwc  updates  can  take 
place  incrementally  in  a  series  of  finite  time  intervals  interleaved  with  real-time  control  decisions. 

Techniques  Uke  this  are  described  in  [Sutton,  1990b,  Peng  and  Williams,  1993,  Moore  and  Atkeson 
1993].  • : 

The  foUowmg  .tlieorem  has  not  been  proved  but  we  expect  few  difficulties:  If  a  solution  exists 
from  aU  real-valued  states-  in  all  partitions,  according  to  Algorithm  (2),  then  Algorithm  (3)will, 

in  fewer  than  partition  transitions,  also  find  a  solution  from  its  initial  state,  where  N  is  the 
number  of  partitions. 

A  final  note  about  Algorithm  (3)  is  necessary.  More  general  systems  than  mazes  wUl  produce 
more  interesting  games.  Later  we  will  see  examples  of  non-uniform  partitionings  and  of  dynamics 
that  produce  curved  trajectories  through  space.  Both  cases  can  produce  more  detailed  game 
structures  than  stuck/non-stuck  transitions,  and  Algorithm  (3)  is  applicable  in  these  cases  too. 
Such  a  structure  is  shown  in  Figure  7. 
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Figure  7.  Partitions  1  and  3  are  losers  because 
the  adversary  can  force  a  permanent  loop  be¬ 
tween  them. 


3.4  Algorithm  (4):  Varying  the  Resolution 

We  do  not  wish  the  system  to  give  up  when  it  discovers  it  is  in  a  partition  for  which  Jwc  =  oo. 
The  correct  interpretation  of  a  losing  partition  is  that  the  planner  needs  higher  resolution,  anf 
parti-game  gives  it  that  by  dividing  some  coarse  partitions  in  two.  1 

Interestingly,  it  is  not  necessarUy  worth  increasing  the  resolution  of  the  partition  the  system  is 
in,  nor  is  it  necessarUy  worth  splitting  all  partitions  which  have  Jwc  =  oo.  Figure  8  shows  a  case 
in  which  the  current  state  is  in  a  partition  which  it  wUl  not  help  to  spHt.  This  is  because  there  is 
no  path  to  a  non-losing  state  from  the  current  partition  anyway  (other  losing  partitions  block  us) 
and  no  matter  how  high  we  make  its  resolution  our  current  partition  will  remain  a  loser. 


Figure  8:  A  12-cell  partitioning  in  which 
it  will  not  help  to  split  the  partition  con¬ 
taining  the  current  state. 


It  is  the  partitions  on  the  border  between  losing  regions  and  non-losing  regions  which  should  be 


LOSE 

1 

LOSE 

1  STEP 

GOAL 

Current 

State 

• 

LOSE 

2  STEPS 

1  STEP 

LOSE 

LOSE 

LOSE 

LOSE 

LOSE 
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split.  Under  the  assumption  that  aU  paths  through  state  space  are  continuous,  and  also  assuming 
that  a  path  to  the  goal  actuaUy  exists,  there  must  currently  be  a  hole  in  one  of  the  border  partitions 

which  has  been  missed  by  the  over-coarseness.  This  motivates  the  final  algorithm  which  we  present. 
The  algorithm  takes  three  inputs: 

•  The  current  real-valued  system  state,  s. 

•  A  partitioning  of  state-space,  P. 

.  A  database,  D,  of  all  empiricaUy  experienced  (start,  aimed-for,  actual-outcome)  triplets. 

It  returns  two  outputs:  a  new  partitioning  of  state-space  and  a  new  database. 

ALGORITHM  (4):  (PARTIGAME). 

WHILE  (  3  not  in  the  goal  partition  ) 

1  Compute  JwcO  for  each  partition  using  minimax.  J 

2  Run  Algorithm  (3)  on  s  and  P,  retrieving  the  resulting  additions  to  the  database  D,  plus  the 
new  real- valued  state  s,  and  a  success/failure  signal. 

3  If  FAILURE  was  signaled 

3.1  Let  ^  :®  All  partitions  in  P  for  which  Jwc  =  oo. 

3.2  Let  Q'  :«  Members  of  Q  who  have  any  non-losing  neighbors. 

3.3  Let  Q"  :=  Q'  and  all  non-losing  neighbors  of  members  of  Q'. 

'3.4  Construct  a  new  set  of  partitions  from  0\'of  twice  the  size,  produced  by  spUtting 
each  partition  in  half  along  its  longest  axis.  CaU  this  new  set  R. 

3.5  P  :*  p  +  Ji-Q» 

3.8  Recompute  aU  new  neighbor  relations  and  delete  those  members  of  database  D  which 
contain  a  member  of  Q"  as  a  start  point,  an  aim-for  or  an  actual-outcome. 

LOOP 

3.5  Partigame  Details 


Initialization 
Before  the  very  first  trial. 


parti-game  is  initialized  as  just  two  partitions:  a  goal  partition  covering 
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t  e  goal  region,  and  one  other  large  partition  covering  the  rest  of  state-space.  At  that  point 

gorithm  (4)  IS  called.  Unless  the  system  is  very  lucky,  this  trivial  partitioning  will  not  be’ 

a  equate  to  reach  the  goal  using  the  greedy  controller.  At  the  point  when  this  is  detected  the 
initial,  trivial  partitioning  will  quickly  start  splitting. 

Increasing  the  resolution 

Notice  that  this  algorithm  increases  the  resolution  at  both  sides  of  the  win/lose  border  This 
prevems  enormous  partitions  from  bordering  tiny  partitions.  There  could  be  other  ^orithms  in 

Tpel  f„  f T'‘°" 

open  for  further  investigation. 


Planning  and  learning  in  parti-game 

Partigame  performs  planning  and  learning  simultaneously.  Interestingly,  these  frvo  components^ 

are  great  help  to  each  other.  The  learning  consists  of  gathering  data  to  build  up  the  sets  off 

path  °t  ^  of  transitions  between  partitions.  This  data  is  gathered  by  planning ' 

if  ac'  V"  TT”  “interesting”  actions.  A  partition  seems  interesting 

AigotlLTs  )  f  -  fc-uH  tried  will  work  (Equation  8  in 

Algorithm  (3)),  the  partition  hes  on  the  shortest  path  to  the  goal 

The  planni^ag  is  helped  by  the  lea^ng  because  the  computation  and  representation  ate  con 
cOn  ra  e  on  the  parts  of  the  state  space  which,  according  to  the  database  of  experiences,  are  most 

The  goAI  partition 

The  goal  partition  is  special.  It  never  changes  or  eets  snlit  TKm  t  v  j  c  a 

t},„  ®  ®  1®  defined  to  be  solved  when 

1  — ■  "  -  —  section  4  it  i! 

is  Zt7o  T" *»  '«^o-npnte  all  the  neighbors  that  it 
::: ‘1 -^bi  — -  --  -  --  p--  as 


4  Experiments 

AU  these  experiments  are  broken  into  trials.  On  each  trial  the  sv.i.  •  ,  .  • 

and  the  trial  nrnf..mri  ■,  u  ^  ^  state 

the  trial  proceeds  until  the  system  enters  the  goal  region 
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Figure  9:  A  twodimensional  maze  problem.  The  point  Figure  10:  The  path  taken  during  the  entire  first,  trial, 
robot  must  find  a  path  from  start  to  goal  without  crossing  It  begins  with  intense  exploration  to  find  a  route  out  of 
any  of  the  barrier  lines.  Remember  that  initially  it  does  the  almost  entirely  enclosed  start  region.  Having  eventu- 
not  know  where  any  obstacles  are,  and  must  discover  them  ally  reached  a  sufficiently  high  resolution,  it  discovers  the 
by  finding  impassable  states.  gap  and  proceeds  greedily  towards  the  goal,  only  to  be 

stopped  by  the  goal's  barrier  region.  The  next  barrier  is 
traversed  at  a  much  lower  resolution,  mainly  because  the 
gap  is  larger. 

4.1  Maze  navigation 

Figure  9  shows  a  two-dimensional  continuous  maze.  Figure  10  shows  the  performance  of  the  robot 
during  the  very  first  trial.  Figure  11  shows  the  second  trial,  started  from  a  slightly  different 
position.  The  poUcy  derived  from  the  first  trial  gets  us  to  the  goal  without  further  exploration. 
The  trajectory  has  unnecessary  bends.  This  is  because  the  controUer  is  discretized  according  to 
the  current  partitioning.  If  necessary,  a  local  optimizer  could  be  used  to  refine  this  trajectory^. 

The  system  does  not  explore  unnecessary  areas.  The  barrier  in  the  top  left  remains  at  low 
resolution  because  the  system  has  had  no  need  to  visit  there.  Figures  12  and  13  show  what 
happens  when  we  now  start  the  system  inside  this  barrier. 

^Another  method  is  to  increase  the  resolution  along  the  trajectory  [Moore,  199l]. 
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Figure  11:  The  second  trial. 


Figure  12:  Starting  inside  the  top 
left  barrier. 


Figure  13:  The  trial  after  that. 


4.2  Non-linear  dynamics  I 

Figure  14  depicts  a  frictionless  puck  on  a  bumpy  surface.  It  can  thrust  left  or  right  with  a  maximum 
thrust  of  ±4  Newtons.  Because  of  gravity,  there  is  a  region  near  the  center  of  the  hill  at  which  the 
maximum  rightward  thrust  is  not  strong  enough  to  accelerate  up  the  slope.  Thus  if  the  goal  is  at 
the  top  of  the  slope,  a  strategy  which  proceeded  by  greedily  choosing  actions  to  thrust  towards  the 
goal  would  get  stuck. 


Position  (x) 


Figure  14:  A  frictionless  puck 
acted  on  by  gravity  and  a  hor¬ 
izontal  thruster.  The  puck  must 
get  to  the  goal  as  quickly  as  pos¬ 
sible.  There  are  bounds  on  the 
maximum  thrust. 


This  is  made  clearer  in  Figure  15,  a  phase  space  diagram.  The  puck’s  state  has  two  components, 
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the  position  and  velocity.  The  hairs  show  the  next  state  of  the  puck  if  it  were  to  thrust  rightwards 
with  the  maximum  legal  force  of  4  Newtons.  Notice  that  at  the  center  of  state  space,  even  when 
this  thrust  is  applied,  the  puck  velocity  decreases  and  it  eventually  slides  leftwards.  The  optimal 
solution  for  the  puck  task,  depicted  in  Figure  16,  is  to  initially  thrust  away  from  the  goal,  gaining 
negative  velocity,  until  it  is  on  the  far  left  of  the  diagram.  Then  it  thrusts  hard  right,  to  build  up 
sufficient  energy  to  reach  the  top  of  the  hill. 


Position 


Figure  15:  The  state  transition 
function  for  a  puck  which  con¬ 
stantly  thrusts  right  with  maxi¬ 
mum  thrust. 


0 

Position 


Figure  16:  The  “minimum¬ 
time”  path  from  start  to  goal 
for  the  puck  on  the  hill.  The 
optimal  value  function  is  shown 
by  the  background  dots.  The 
shorter  the  time  to  goal,  the 
larger  the  black  dot.  Notice  the 
discontinuity  at  the  escape  ve¬ 
locity. 


•  The  local  greedy  controUer  which  parti-game  uses  is  bang-bang.  To  aim  for  a  partition  “north” 
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in  state  space — a  partition  with  greater  velocity — it  thrusts  with  the  maximum  permissable  force 
of  +4A''.  To  aim  for  a  lower  velocity  partition  it  thrusts  with  -AN.  To  aim  for  an  "east”  or  “west” 
partition,  the  local  controller  merely  controls  its  velocity  (using  a  trivial  linear  controller)  to  be 
equal  to  the  velocity  of  the  center  of  the  destination  partition.  Notice  that  if  the  current  partition’s 
velocity  is  greater  than  zero  it  is  hopeless  to  greedily  aim  for  the  partition  on  the  left.  It  is  also 
hopeless  to  aim  at  the  partition  on  the  right  if  the  current  partition  has  negative  velocity.  In  the 
experiments  below  Parti-game  is  given  this  extra  information.  Forcing  parti-game  to  learn  this 
from  experience  approximately  doubles  the  learning  time. 

Figure  17  shows  the  trajectory  through  state  space  during  the  very  first  learning  trial,  while  it 
is  exploring  and  developing  its  initial  partitioning.  Figure  18  shows  the  resulting  partitioning  and 
the  subsequent  trajectory^:  on  its  second  trial  it  has  already  learned  the  basic  strategy  of  “begin 
by  getting  a  negative  velocity,  moving  backwards,  and  only  then  heading  forward  with  full  thrust”. 
Figure  19  shows  the  interesting  result  of  running  many  more  trajectories,  each  starting  at  randoms 
parts  of  state  space.  Many  partitions  are  created  and  refined,  but  only  around  the  critical  border?' 
in  state  space  which  serves  as  the  escape  velocity  of  the  problem  (also  visible  as  the  discontinuity ' 
in  Figure  16).  This  high  resolution  line  arises  not  out  of  any  pre-programmed  knowledge  of  the 
escape  velocity  but  because  the  system  does  not  need  to  increase  the  resolution  of  partitions  which 
fail  to  intersect  the  escape  velocity  region. 

4.3  Higher  dimensional  state  spaces 

Figure  20  shows  a  three-dimensional  state-space  problem.  If  a  standard  grid  were  used,  this  would 
need  an  enormous  number  of  states  because  the  solution  requires  detailed  maneuvers.  Parti-game’s 
total  exploration  took  18  times  as  much  movement  as  one  run  of  the  final  path  obtained. 

Figure  21  shows  a  four- dimensional  problem  in  which  a  ball  slides  on  a  tray  with  steep  edges. 
The  goal  is  on  the  other  side  of  a  ridge.  The  maximum  permissible  force  is  low.  Greedy  strategies, 
or  globally  linear  control  rules,  get  stuck  in  limit  cycles  within  a  valley.  The  local  greedy  controller 
to  navigate  between  adjacent  partitions  is  bang-bang  controller.  Parti-game’s  solution  runs  to  the 
far  end  of  the  tray,  to  build  up  enough  velocity  to  make  it  over  the  ridge.  The  exploration-length 
versus  final-path-length  ratio  is  24. 

Figure  22  shows  a  9-joint  snake-like  robot  manipulator  which  must  move  to  a  specified  configu¬ 
ration  on  the  other  side  of  a  barrier.  Again,  no  kinematics  model  or  knowledge  of  obstacle  locations 

Careful  inspection  of  this  diagram  reveals  that  the  trajectory  changes  direction  not  at  the  borders  of  cells  but 
instead  within  cells.  This  is  because  the  current  implementation  waits  until  it  is  well  within  a  cell  before  applying 
the  cell’s  recommended  action. 
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Figure  17:  The  trajectory  of  the 
very  first  trial,  while  the  system  per¬ 
formed  its  initial  exploration  of  state 
space. 


Positioa 


Figure  18:  The  trajectory  and  par¬ 
titioning  of  the  second  trial. 


Positioii 


Figure  19:  The  partitioning  after  it 
has  learned  the  task  from  200  ran¬ 
dom  start  positions. 


are  given:  the  system  must  learn  these  as  it  explores.  It  takes  seven  trials  before  converging  on 
the  solution  shown,  which  requires  about  two  minutes  run-time  on  a  SPARC-I  workstation.  The 
exploration-length  versus  final-path-length  ratio  is  60.  Interestingly,  the  final  number  of  partitions 
is  only  85.  This  compares  very  favorably  with  the  512  partitions  which  would  be  needed  if  the 
coarsest  non-trivial  uniform  grid  were  used:  2  x  2  x  •  ■  •  x  2.  Unsurprisingly,  for  the  9-joint  snake, 
this  512  uniform  grid  is  too  coarse,  and  in  experiments  we  performed  with  such  a  grid  the  system 
became  stuck,  eventually  deciding  the  problem  was  impossible. 

5  Related  work 

A  few  other  researchers  in  Reinforcement  Learning  have  attempted  to  overcome  dimensionality 
problems  by  decompositions  of  state  space.  An  early  attempt  was  [Simons  et  ai,  1982]  who 
attempted  it  for.3-degree-of-freedom  force  control.  Their  method  gradually  learned  by  recording 
cumulative  statistics  of  performance  in  partitions.  More  recently,  we  produced  a  variable  resolution 
dynamic  programming  method  [Moore,  1991].  This  enabled  conventional  dynamic  programming  to 
be  performed  in  real  valued  multivariate  state  spaces  where  straightforward  discretization  would 
fall  prey  to  the  curse  of  dimensionality.  This  is  another  approach  to  partitioning  state  space  but 
has  the  drawback  that,  unlike  parti-game,  it  requires  a  guess  at  an  initiaUy  valid  trajectory  through 
state  space.  [Chapman  and  Kaelbling,  1991]  proposed  an  interesting  algorithm  which  used  more 
sophisticated  statistics  to  decide  which  attributes  to  split.  Their  objectives  were  very  hard  because 
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Figure  20:  A  problem  with  a  planar  rod  being  guided  past  obstacles.  The  state-space  is  three- 
dimensional:  two  values  specify  the  position  of  the  rod’s  center,  and  the  third  specifies  the  rod’s 
angle  from  the  hori2ontaJ.  The  angle  is  constrained  so  that  the  pole’s  dotted  end  must  always  be 
below  the  other  end.  The  pole’s  center  may  be  moved  a  short  distance  (up  to  1/40  of  the  diagram 
width)  and  its  angle  may  be  altered  by  up  to  5  degrees,  provided  it  does  not  hit  a  barrier  in  the 
process.  Parti-game  converged  to  the  path  shown  below  after  two  trials,  with  18  times  as  many 
exploration  steps  as  there  are  steps  in  the  final  path.  The  partitioning  lines  on  the  second  diagram 
only  show  a  two-dimensional  slice  of  the  full  partitioning. 
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Figure  21:  A  puck  aluling  over  a  hilly  anifaice  (hills  shown  by  contours  below:  the  surface  is  bowl 
shaped,  with  the  start  and  goal  states  at  the  bottoms  of  distinct  valleys).  The  state-space  is  ibur- 
dimeiudonal:  two  position  and  two  velocity  variables.  The  controls  consist  of  a  force  whidh  may  be 
applied  in  any  direction,  but  with  bounded  magnitude.  Convergence  time  was  two  trials,  with  24 
times  as  much  escploratiou  as  there  are  steps  in  the  final  path. 
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Figure  22:  A  nine-degree-of-freedom  planar  robot  must  move  from  the  shown  start  configuration 
to  the  goal.  The  joints  are  shown  by  small  circles  on  the  left-hand  diagram  which  depicts  two 
configurations  of  the  arm:  the  start  position  and  the  goal  position.  The  solution  entails  curling, 
rotating  and  then  uncurling.  It  may  not  intersect  with  any  of  the  barriers,  the  edge  of  the  workspace, 
or  itself.  Convergence  occurred  after  seven  trials,  with  60  times  as  much  exploration  as  there  are 
steps  in  the  final  path. 
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they  wished  to  avoid  remembering  transitions  between  cells  and  they  did  not  assume  continuous 
paths  through  state  space,  and  so  they  obtained  only  limited  empirical  success. 

In  [Dayan  and  Hinton,  1993]  a  2-dimensional  hierarchical  partitioning  was  used  on  a  grid  with 
64  discrete  squares,  and  [Kaelbling,  1993]  gives  another  hierarchical  algorithm.  These  references 
both  attempt  a  different  goal  than  parti-game:  they  try  to  accelerate  Q-learning  [Watkins,  1989] 
by  providing  it  with  a  pre-programmed  abstraction  of  the  world.  The  abstraction,  it  is  noted  in 
both  cases,  may  sometimes  indeed  lead  to  faster  learning  and  can  improve  Q-learning  if  there  are 
multiple  goals  in  the  problem.  In  contrast,  parti-game  is  able  to  build  its  own  abstraction  using 
geometric  reasoning  and  so  learns  more  quickly  (typically  in  fewer  than  ten  trials  and  a  few  minutes 
of  real  time)  and  on  significantly  higher  dimensional  problems  than  have  been  attempted  elsewhere. 
The  price  parti-game  pays  is  that  it  is  limited  to  geometric  abstractions,  whereas  both  Kaelbling’s 
and  Dayan’s  methods  may  eventually  be  applicable  to  other  abstraction  hierarchies. 

Geometric  Decompositions  have  also  been  used  fairly  extensively  in  Robot  Motion  Planningl 
(e.g.  [Brooks  and  Lozano-Perez,  1983,  Kambhampati  and  Davis,  1986]),  summarized  in  [Latombe,^^ 
1991].  The  principal  difference  is  that  the  Robot  Motion  Planning  methods  all  assume  that  a  model 
of  the  environment  (typically  in  the  form  of  a  pre-programmed  list  of  polygons)  is  supplied  to  the 
system  in  advance  so  that  there  is  no  learning  or  exploration  capability.  The  experiments  in  [Brooks 
and  Lozano-Perez,  1983]  involve  a  3-degree-of-freedom  navigation  problem  and  in  [Kambhampati 
and  Davis,  1986],  a  fairly  difficult  2-dimensional  maze. 

Finally,  some  relation  can  be  seen  between  parti-game  and  multigrid  methods  (e.g.  [Hoppe, 
1986])  used  in  numerical  analysis  to  accelerate  the  convergence  of  solutions  to  partial  differential 
equations.  Multigrid  methods  typically  increase  the  resolution  of  the  grid  everywhere,  and  like 
robot  motion  planning,  do  not  learn:  the  correct  system  dynamics  must  be  programmed  in. 


6  Discussion 

6.1  Splitting. 

Given  a  partition  we  have  decided  to  split,  which  axis  should  be  split?  Algorithm  (4)  states  that 
we  should  split  along  the  longest  axis.  This  begs  the  question  of  where  to  split  in  the  case  of  ties. 
The  current  algorithm  resolves  ties  with  a  fixed  ordering  on  axes,  but  we  could  be  cleverer.  In 
Figure  23  it  is  clear  that  a  vertical  split  would  be  more  useful  than  a  horizontal  split.  This  kind  of 
intelligent  split  choice,  which  pays  attention  to  the  locations  of  outcomes,  would  not  be  difficult  to 
incorporate. 
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Figure  23:  We  have  had  two  experiences 
of  attempting  to  move  North  from  two  dif¬ 
ferent  points  in  the  partition.  Only  one 
succeeded. 


6.2  Not  forgetting 

Wher  a  partition  is  split,  the  outcomes  of  its  children  are  initialized  to  be  empty  and  so  old  data 
is  forgotten.  This  is  not  desirable  or  necessary.  Old  trajectories  could  be  retained  and  used  to 
initialize  the  OUTCOMES()  sets  of  those  children  within  which  earlier  trajectory  segments  lay. 

6.3  Learning  the  local  greedy  controllers 

The  parti-game  algorithm  requires  that  the  user  defines  local  greedy  controllers.  Is  this  not  a  large 
sacrifice  of  autonomy?  We  argue  not:  learning  greedy  controllers  merely  requires  gathering  enough 
local  experience  to  form  a  local  linear  map  of  the  low  level  system  dynamics.  This  can  be  done 
with  relative  ease,  both  in  a  statistical  and  computational  sense. 

6.4  Dealing  with  an  unknown  goal  state 

There  is  no  difficulty  for  parti-game  in  removing  the  assumption  that  the  location  of  the  goal  state 
is  known.  Convergence  will  be  considerably  slowed  down  if  it  is  not  given,  but  this  is  not  the  fault 
of  the  algorithm.  If  there  are  D  state  variables  and  the  goal  is  signaled  when  .all  state  variables 
are  simultaneously  within  ±6%  of  an  unknown  goal  value,  then  it  is  clear  that  an  exploration  of  at 
least 
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points  on  a  grid  in  state-space  are  needed  to  ensure  the  goal  is  reached  even  once,  whatever  the 
learning  algorithm. 

A  simple  supplement  to  parti-game  can  be  made  to  implement  this  kind  of  uniform  exploration. 
It  begins  with  a  uniform  grid  partition  with 

100 

~  (10) 

breaks  on  each  axis  and  encourages  exploration  by  estimating  the  Jwc  value  of  aU  unvisited 
partitions  as  zero.  At  this  resolution  at  least  one  initial  partition  must  be  a  proper  subset  of  the 

goal  region  and  so  once  the  system  has  entered  any  part  of  each  initial  partition  the  goal  must  have 
been  discovered. 

6.5  Attaining  Optimality 

Parti-game  is  designed  to  find  solutions  to  delayed  reward  control  problems  in  reasonable  time- 
without  needing  help  in  the  form  of  initial  human-supplied  trajectories.  The  algorithm  works  hard.’ 
to  find  a  solution  but  makes  no  attempt  to  optimize  it.  Empirically,  aU  solutions  found  have  been 
good.  There  are  a  number  of  kinds  of  suboptimabty  which  parti-game  wiU  not  produce.  In  the  case 
of  navigation,  for  example,  parti-game  cannot  produce  loops  or  meanders,  as  shown  in  Figure  24. 

Figure  24:  Partigame  cannot  pro¬ 
duce  either  of  these  kinds  of  sub¬ 
optimality.  No  loops,  and  no  un¬ 
blocked  adjacent  partitions  which 
contain  separate  parts  of  a  solution 
trajectory. 

The  lack  of  guaranteed  optimality  in  parti-game  is  a  concession  to  the  fact  that  there  is  unlikely 
to  be  sufficient  time  in  the  lifetime  of  a  reinforcement  learning  system  to  explore  every  possible 
solution.  Future  research  may  reveal  ways  to  achieve  weaker  optimality  guarantees; 

•  That  the  solution  is  locally  optimal. 

•  A  proof  that  even  if  the  solution  is  not  globally  optimal,  the  global  solution  can  be  no  better 
than  factor  K  (in  terms  of  cost  units  for  the  task  being  learned)  over  parti-game’s  solution. 

Both  these  optimality  statements  wiU  require  extra  assumptions  about  the  state  space.  In  the  case 
of  navigational  problems,  this  can  come  in  the  form  of  geometric  reasoning.  In  dynamics  problems 
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it  will  be  by  means  of  local  linearizations  within  partitions,  and  subsequent  Linear  Quadratic 
Gaussian  (LQG)  local  control  design  (see,  for  example,  [Sage  and  White,  1977]). 

It  is  also  possible  that  other  domains  will  be  able  to  use  similar  reasoning  by  means  of  admis¬ 
sible  heuristics:  a  classical  method  in  AI  for  formally  reasoning  about  the  optimality  of  proposed 
solutions  [Nilsson,  1971]. 

6.6  Multiple  Goals 

Because  it  builds  an  explicit  model  of  all  the  possible  state  transitions  between  partitions,  it  is  a 
trivial  matter  for  parti-game  to  change  to  a  new  goal.  We  have  performed  a  number  of  experiments 
(not  reported  here)  that  confirm  this. 

6.7  Stochastic  Dynamics 

This  is  the  hardest  issue  for  parti-game  to  cope  with.  If  a  given  action  in  a  given  partition  producers 
multiple  results,  how  do  we  decide  if  this  is  due  to  inherent  randomness  or  due  to  overly  coarse  ^ 
partitions?  In  the  latter  case  it  will  be  helpful  to  increase  the  resolution  and  in  the  former  case  it 
wiU  not. 

The  easiest  case  will  be  noise  in  the  form  of 

next-state  =  /(state,  action)  -f  noise-signal ()  (H) 

An  example  is  an  environment  which  randomly  jogs  a  mobile  robot  between  each  movement.  We 
have  performed  some  experiments  with  parti-game  under  this  scenario  (not  reported  here),  and 
have  not  yet  seen  it  get  stuck  even  when  quite  substantial  noise  was  added.  In  principal,  though, 
any  amohnt  of  noise  could  break  the  partigame  algorithrn— if  trials  were  run  indefinitely,  eventually 
all  of  state  space  would  become  partitioned  to  unboundedly  high  resolution.  An  improvement  to 
parti-game  might  use  statistical  tests  which  try  to  explain  outcomes  in  terms  of  location  within  the 
partition.  This  might  help,  but  further  research  is  needed. 

If  the  randomness-  is  something  which  occasionally  teleports  the  system  to  a  random  place 
(breaking  the  assumption  of  paths  being  continuous  through  state  space),  then  partigame  would 
probably  need  an  entirely  different  splitting  criterion.  One  possibility  is  a  version  of  the  “G” 
splitting  rule  of  [Chapman  and  Kaelbllng,  1991]. 

6.8  The  Curse  of  Dimensionality 

We  finish  by  noting  a  promising  sign  involving  a  series  of  snake  robot  experiments  with  different 
numbers  of  links  (but  fixed  total  length)'.  Intuitively,  the  problem  should  get  easier  with  more  links. 
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but  the  curse  of  dimensionality  would  mean  that  (in  the  absence  of  prior  knowledge)  it  becomes 
exponentially  harder.  This  is  borne  out  by  the  observation  that  random  exploration  with  the 
three^link  arm  will  stumble  on  the  goal  eventually,  whereas  the  nine  link  robot  cannot  be  expected 
to  do  so  in  tractable  time.  However,  Figure  25  indicates  that  as  the  dimensionality  rises,  the 
amount  of  exploration  (and  hence  computation)  used  by  parti-game  does  not  rise  exponentially. 
It  is  conceivable  (but  not  supported  by  further  evidence  in  this  paper)  that  real-world  tasks  may 
often  have  the  same  property:  the  complexity  of  the  ultimate  task  remains  roughly  constant  as 
the  number  of  degrees  of  freedom  increases.  If  so,  this  might  be  the  Achilles’  heel  of  the  curse  of 
dimensionality. 


Figure  25:  The  number  of  partitions  fi¬ 
nally  created  against  degrees  of  freedom 
for  a  set  of  snake-like  robots.  The  parti¬ 
tionings  built  were  all  highly  non-uniform, 
typically  having  maximum  depth  nodes  of 
twice  the  dimensionality.  The  relation  be¬ 
tween  exploration  time  and  dimensional¬ 
ity  (not  shown)  had  a  similar  shape. 


7  Conclusion 

This  paper  began  with  the  problems  of  coarse  partitionings  of  state  space.  It  then  showed  how 
worst-case  assumptioiis  can  solve  these  problems,  and  very  effectively  identify  partitions  which  need 
to  have  their  resolutions  increased.  There  are  many  interesting  avenues  arising  from  these  ideas 
which  remain  open  for  further  investigation. 
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Abstract 

We  present  a  new  algorithm,  Prioritized  Sweeping,  for  efficient  prediction  and  control  of  stochas¬ 
tic  Markov  systems.  Incremental  learning  methods  such  as  Temporal  Differencing  and  Q- 
learning  have  fast  real  time  performance.  Classical  methods  are  slower,  but  more  accurate, 
because  they  make  full  use  of  the  observations.  Prioritized  Sweeping  aims  for  the  best  of  both 
worlds.  It  uses  all  previous  experiences  both  to  prioritize  import eint  dynamic  programming 
sweeps  and  to  guide  the  exploration  of  state-space^  We  compare  Prioritized  Sweeping  with 
other  reinforcement  lecirning  schemes  for  a  number  of  different  stochastic  optimal  control  prob¬ 
lems.  It  successfully  solves  lajge  state-space  real  time  problems  with  which  other  methods  have 
difficulty. 
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1  Introduction 


This  paper  introduces  a  memory-based  technique,  prioritized  sweeping,  which  can  be  used  both  for 
Markov  prediction  and  reinforcement  learning.  Current,  model-free,  learning  algorithms  perform 
well  relative  to  real  time.  Classical  methods  such  as  matrix  inversion  and  dynamic  programming 
perform  well  relative  to  the  number  of  observations.  Prioritized  sweeping  seeks  to  achieve  the  best 
of  both  worlds.  Its  closest  relation  from  conventional  AI  is  the  search  scheduling  technique  of  the 
A*  algorithm  (Nilsson  1971).  It  is  a  “memory-bctsed”  method  (StanfiU  and  Waltz  1986)  in  that  it 
derives  much  of  its  power  from  explicitly  remembering  all  real-world  experiences.  Closely  related 
research  is  being  performed  by  Peng  and  Williams  (1992)  into  a  similar  algorithm  to  prioritized 
sweeping,  which  they  call  Dyna-Q-queue.  ,  j- 

^  '  '‘S5 

We  begin  by  providing  a  review  of  the  problems  and  techniques  in  Markov  prediction  anil 
control.  More  thorough  reviews  may  be  found  in  Sutton  (1988),  Barto  et  al.  (1989),  Sutton  (1990), 
Kaelbling  (1990)  and  Barto  et  al.  (1991). 

A  discrete,  finite  Markov  system  has  S  states.  Time  passes  as  a  series  of  discrete  clock  ticks, 
and  on  each  tick  the  state  may  change.  The  probability  of  possible  successor  states  is  a  function 
only  of  the  current  system  state.  The  entire  system  can  thus  be  specified  by  S  and  a  table  of 
transition  probabilities. 

,  9ii  9i2  ••,•  915 

921  922  •  •  •  925 

:  :  : 

951  952  •  *  •  955 

where  denotes  the  probability  that,  given  we  are  in  state  t,  we  will  be  in  state  j  on  the  next 
time  step.  The  table  must  satisfy  I2f=i  9«j  ^  every  i. 

Figure  1  shows  an  example  with  six  states  corresponding  to  the  six  cells.  With  the  exception  of 
the  rightmost  states,  on  each  time  step  the  system  moves  at  random  to  a  neighbor.  For  example, 
state  1  moves  directly  to  state  3  with  probability  i,  and  thus  qiz  =  i. 

The  state-space  of  a  Markov  system  is  partitioned  into  two  subsets:  the  non-terminal  states 
NONTERMS,  and  the  terminal  states  TERMS.  Once  a  terminal  state  is  entered,  it  is  never  left 
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{k  6  TERMS:=>qkk  =  !)•  the  example,  the  two  rightmost  states  are  terminal, 

A  Markov  system  is  defined  as  absorbing  if  from  every  non-terminal  state  it  is  possible  to 
eventually  enter  a  terminal  state.  We  restrict  our  attention  to  absorbing  Markov  systems. 

Let  us  first  consider  questions  such  as  ‘‘starting  in  state  i,  what  is  the  probability  of  eventual 
absorption  by  terminal  state  kV\  Write  this  value  as  Ttik,  All  the  absorption  probabilities  for 
terminal  state  k  can  be  computed  by  solving  the  following  set  of  linear  equations.  Assume  that  the 
non-terminal  states  are  indexed  by  1,2,. . . ,  Sfit  where  Snt  Is  the  number  of  non-terminals. 


^Ik 

=  <hk 

+ 

+ 

?12^2fc 

+  •  • 

•  + 

9iS„,7r5„tfc 

'^2k 

= 

+ 

921^2jfe 

+ 

<l22T^2k 

+  •  • 

•  + 

q^Snt  ^Sntk 

'^Sntk  -  qSntk  +  qSntl^lk  +qSnt2'^2k  +  +  95„t5nt^SntA: 

When  the  trajisition  probabilities  {ftj}  are  known  it  is  thus  an  easy  matter  to  compute  the  eventual 
absorption  probabilities.  Machine  le2u:ning  can  be  applied  to  the  case  in  which  the  trajisition 
probabilities  are  not  known  in  advance,  and  all  we  may  do  instead  is  watch  a  series  of  state 
transitions.  Such  a  series  is  normally  arranged  into  a  set  of  trials — each  trial  starts  in  some  state 
and  then  continues  until  the  system  enters  a  terminal  state.  In  our  example,  the  learner  might  be 
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shown 


3-^4->3^1^2-*4-^6 

3^5 

(3) 

1  _  2  1  ^  3  -V  5 

Learning  approaches  to  this  problem  have  been  widely  studied.  A  recent  contribution  of  great 
relevance  is  an  elegaat  algorithm  called  Temporal  Differencing  (Sutton  1988). 


1.1  The  Temporal  Differencing  algorithm  reviewed 


We  describe  the  discrete  state-space  case  of  the  temporal  differencing  algorithm.  TD  can,  how¬ 
ever,  also  be  applied  to  systems  with  continuous  state-spaces  in  which  long  term  probabilities  are 
represented  by  parametric  function  approximators  such  as  neural  networks  (Tesauro  1991). 

The  prediction  process  runs  in  a  series  of  epochs.  Each  epoch  ends  when  a  terminal  state  is 
entered.  Assume  we  have  passed  through  states  ii,i2,. .  .in,in+i  so  far  in  the  current  epoch,  n  is 
our  age  within  the  epoch  and  t  is  our  global  age.  in-*in+i  is  the  most  recently  observed  transition. 
Let  TTik  [t]  be  the  estimated  value  of  Tiik  after  the  system  has  been  running  for  t  state  transition 
observations.  Then  the  TD  algorithm  for  discrete  state-spaces  updates  these  estimates  according 
to  the  following  rule: 


for  each  i  €  NONTERMS  set  of  non-terminal  states) 
for  each  k  €  TERMS  ( the  set  of  terminal  states ) 

n 

^ik  [i  +  1]  =  T^ik  [i]  +  a  W  -  7r,-„fc  [t])  \^~'^Xi{ij) 

3=1 


(4) 


where  a  is  a  learning  rate  parameter  0  <  a  <  1,  where  A  is  a  memory  constant  0  <  A  <  1  and 
where 


Xilij) 


1  if  ij  =  i 


(5) 


[  0  otherwise 

In  practice  there  is  a  computational  trick  which  requires  considerably  less  computation  than  the 
algorithm  of  Equation  (4)  but  which  computes  the  same  values  (Sutton  1988).  The  TD  algorithm 
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then  requires  0{St)  computation  steps  per  real  observation,  where  St  is  the  number  of  termi¬ 
nal  states.  Convergence  proofs  exist  for  several  formulations  of  the  TD  algorithm  (Sutton  1988; 
Dayan  1992). 

1.2  The  classical  approach 

The  classical  method  proceeds  by  building  a  maximum  likelihood  model  of  the  state  transitions. 
q{j  is  estimated  by 

Number  of  observations  i  j 

q--  = - ±  ('g\ 

Number  of  occasions  in  state  i  ^  ^ 

After  t  -1- 1  observations  the  new  absorption  probability  estimates  are  computed  to  satisfy,  for  each 
terminal  state  k,  the  Snt  X  5„t  linear  system 

[t  +  1]  =  +  XI  [^  +  1]  ..  (7) 

iEsuccs(t)nNONTERMS 

where  succs(t)  is  the  set  of  all  states  which  have  been  observed  as  immediate  successors  of  i  and 
NONTERMS  is  the  set  of  non-terminal  states.  It  is  clear  that  if  the  q,,t  estimates  were  correct  then 
the  solution  of  Equation  (7)  would  be  the  solution  of  Equation  (2). 

Notice  that  the  values  7r,fc  \t  -j- 1]  depend  only  on  the  values  of  qtk  after  t  -f  1  observations — they 
are  not  defined  in  terms  of  the  previous  absorption  probability  estimates  iik  [t].  However,  it  is 
efficient  to  solve  Equation  (7)  iteratively.  Let  {p.-jt}  be  a  set  of  intermediate  iteration  variables 
containing  intermediate  estimates  of  x.-jt  [t  -f  1].  What  initial  estimates  should  be  used  to  start  the 
iteration?  An  excellent  answer  is  to  use  the  previous  absorption  probability  estimates  Xffc  [tj. 

The  complete  algorithm,  performed  once  after  every  real-world  observation,  is  shown  in  Figure  2. 
The  transformation  on  the  pi^’s  can  be  shown  to  be  a  contraction  mapping  as  defined  in  Section  3.1 
of  Bertsekas  and  Tsitsiklis  (1989),  and  thus,  as  the  same  reference  proves,  convergence  to  a  solution 
satisfying  Equation  (7)  is  guaranteed.  If,  according  to  the  estimated  transitions,  all  states  can  reach 
a  terminal  state,  then  this  solution  is  unique.  The  inner  loop  (“for  each  k  €  TERMS  •  •  •”)  is  referred 
to  as  a  probability  backup  operation,  and  requires  OiStPsviccs)  basic  operations,  where  psuccs  is  the 
mean  number  of  observed  stochastic  successors. 
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1.  for  each  i  G  NONTERMS,  for  each  k  G  TERMS, 

Pik  '•= 

2.  repeat 

2.1  Aniax  :=  0 

2.2  for  each  i  G  NONTERMS 

for  each  k  G  TERMS 

Pnew  —  Qik  "I"  ^  ^  QijPjk 

j€succs(t) 

^  ”  I  Pnew  pik  | 

Pik  • Pnew 
^^max  •  II13<X^Z^mA.Yy 

until  Am  Ay  €  _  _ 

3.  for  each  i  e  NONTERMS,  for  each  k  G  TERMS 

^»Ar  [f  “1“  1]  •”  pik 

Figure  2:  Stochastic  prediction  with  full  Gauss-Seidel  iteration. 
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Gauss-Seidel  is  an  expensive  algorithm,  requiring  0{Snt)  backups  per  real-world  observation  for 
the  inner  loop  2.2  alone.  The  absorption  predictions  before  the  most  recent  observation,  iik  [i],  nor¬ 
mally  provide  an  excellent  initial  approximation,  and  only  a  very  few  iterations  are  required.  How¬ 
ever,  when  an  “interesting”  observation  is  encountered,  for  example  a  previously  never-experienced 
transition  to  a  terminal  state,  many  iterations,  perhaps  more  than  Sntj  are  needed  for  convergence. 

2  Prioritized  Sweeping 

Prioritized  sweeping  is  designed  to  perform  the  same  task  as  Gauss-Seidel  iteration  while  using 
careful  bookkeeping  to  concentrate  all  computational  effort  on  the  most  “interesting”  parts  of  the 
system.  It  operates  in  a  similar  computational  regime  as  the  Dyna  architecture  (Sutton  1990),  in 
which  a  fixed,  but  non-trivial,  amount  of  computation  is  allowed  between  each  real-world  observa¬ 
tion.  Peng  and  Williams  (1992)  are  exploring  a  closely  related  approach  to  prioritized  sweeping, 
developed  from  Dyna  and  Q-learning  (Watkins  1989). 

Prioritized  sweeping  uses  the  A  value  from  the  probability  update  step  2.2  in  the  previous 
algorithm  to  determine  which  other  updates  are  likely  to  be  “interesting” — if  the  step  produces  a 
large  change  in  the  state’s  absorption  probabilities  then  it  is  interesting  because  it  is  likely  that 
the  absorption  probabilities  of  the  predecessors  of  the  state  will  change  given  an  opportunity.  If, 
on  thetOther  hand,  the  step  produces  a  small  change  then  we  will  assume  that  there  is  less  urgency 
to  process  the  predecessors.  The  predecessors  of  a  state  i  are  all  those  states  i'  which  have,  at  least 
once  in  the  history  of  the  system,  performed  a  one-step  transition  i'  z. 

H  we  have  just  changed  the  absorption  probabilities  of  i  by  A,  then  the  maximum  possible 
one-processing-step  change  in  predecessor  i'  caused  by  our  change  in  i  is  ^q-A.  This  value  is  the 
priority  P  of  the  predecessor  z',  and  if  z'  is  not  currently  on  the  priority  queue  it  is  placed  there  at 
priority  P,  K  it  is  already  on  the  queue,  but  at  lower  priority,  then  it  is  promoted. 

After  each  real-world  observation  z  — >  j,  the  transition  probability  estimate  qij  is  updated 
along  with  the  probabilities  of  transition  to  all  other  previously  observed  successors  of  z.  Then 
state  z  is  promoted  to  the  top  of  the  priority  queue  so  that  its  absorption  probabilities  are  updated 
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immediately.  Next,  we  continue  to  process  further  states  from  the  top  of  the  queue.  Each  state 
that  is  processed  may  result  in  the  addition  or  promotion  of  its  predecessors  within  the  queue.  This 
loop  continues  for  a  preset  number  of  processing  steps  or  until  the  queue  empties. 

Thus  if  a  real  world  observation  is  interesting,  all  its  predecessors  and  their  earlier  ancestors 
quickly  find  themselves  near  the  top  of  the  priority  queue.  On  the  other  hand,  if  the  real  world 
observation  is  unsurprising,  then  the  processing  immediately  proceeds  to  other,  more  important 
areas  of  state-space  which  had  been  under  consideration  on  the  previous  time  step.  These  other 
areas  may  be  different  from  those  in  which  the  system  currently  finds  itself. 

Let  us  look  at  the  formal  algorithm  in  Figure  3.  On  entry  we  assume  the  most  recent  state 
transition  was  from  frecent-  We  drop  the  [t]  suffix  from  the  x.-jt  [t]  notation. 

The  decision  of  when  we  are  allowed  further  processing,  at  the  start  of  Step  2,  could  be  inP 
plemented  in  many  ways.  In  our  subsequent  experiments  the  rule  is  simply  that  a  maximum  of  /? 
backups  are  permitted  per  real-world  observation. 

There  are  many  possible  priority  queue  implementations,  including  a  heap  (Knuth  1973),  which 
was  used  in  all  experiments  in  this  paper.  The  cost  of  the  algorithm  is 

0  +  /ipred»PQC0ST(5„t)))  (8) 

basic  operations,  where  at  most  P  states  axe  processed  from  the  priority  queue  and  PqcOST(iV)  is 
the  cost  of  accessing  a  priority  queue  of  length  N.  For  the  heap  implementation  this  is  logj  N. 

States  axe  only  added  to  the  queue  if  their  priorities  are  above  a  tiny  threshold  e.  This  is  a 
value  close  to  the  machine  floating-point  precision.  Stopping  criteria  are  fraught  with  danger,  but 
in  this  paper  we  discuss  such  dangers  no  further  except  to  note  that  in  our  experiments  they  have 
caused  no  problem. 

Prioritized  sweeping  is  a  heuristic,  and  in  this  paper  no  formal  proof  of  convergence,  or  conver¬ 
gence  rate,  is  given.  We  expect  to  be  able  to  prove  convergence  using  techniques  from  asynchronous 
Dynamic  Programming  (Bertsekas  and  Tsitsiklis  1989)  and  variants  of  the  Temporal  Differencing 
analysis  of  Dayan  (1992).  Later,  this  paper  gives  some  empirical  experiments  in  which  convergence 
is  relatively  fast. 
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1.  Promote  state  ^recent  to  top  of  priority  queue. 

2.  While  we  are  allowed  further  processing  and  priority  queue  not  empty 

2.1  Remove  the  top  state  from  the  priority  queue.  Call  it  i 

2.2  Aniax  ==  0 

2.3  for  each  k  6  TERMS 

Pnew  “  Qik  “h  ^ 

j€succs{{)nNONTERMS 
^  ”  I  Pnew  ““  1 

'^ik  •”  Pnew 

Amax  :=  max(A  max?  A) 

2.4  for  each  €  preds(2) 

P  := 

If  P  >  €  (a  tiny  threshold)  and  if  (i'  is  not  on 
queue  or  P  exceeds  the  current  priority  of  i')  then 
promote  i'  to  new  priority  P. 

Figure  3:  The  prioritized  sweeping  algorithm. 
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The  memory  requirements  of  learning  the  5x5  transition  probability  matrix,  where  5  is  the 
number  of  states,  may  initially  appear  prohibitive,  especially  since  we  intend  to  operate  with  more 
than  10,000  states.  However,  we  need  only  allocate  memory  for  the  experiences  the  system  actually 
has,  and  for  a  wide  class  of  physical  systems  there  is  not  enough  time  in  the  lifetime  of  the  system 
to  run  out  of  memory. 

Similarly,  the  average  number  of  successors  and  predecessors  of  states  in  the  estimated  transition 
matrix  can  be  cissumed  <<  5.  A  simple  justification  is  that  few  real  problems  are  fully  connected, 
but  a  deeper  reason  is  that  for  large  5,  even  if  the  true  transition  probability  matrix  is  not  sparse, 
there  will  never  be  time  to  gain  enough  experience  for  the  estimated  transition  matrix  to  not  be 
sparse. 

3  A  Markov  Prediction  Experiment 

Consider  the  500  state  Markov  system  depicted  in  Figure  4,  which  is  a  more  complex  version  of 
the  problem  presented  in  Figure  1.  Appendix  A  gives  details  of  how  this  problem  was  randomly 
generated.  The  system  has  sixteen  terminal  states,  depicted  by  white  and  black  circles.  The 
prediction  problem  is  to  estimate,  for  every  non-terminal  state,  the  long-term  probability  that  it 
will  terminate  in  a  black,  rather  than  a  white,  circle.  The  data  available  to  the  learner  is  a  sequence 
of  observed  state  transitions.  -  - ... 

Temporal  differencing,  the  classical  method,  and  prioritized  sweeping  were  all  appEed  to  this 
problem.  Each  leaxner  was  shown  the  same  sequence.  TD  used  parameters  A  =  0.25  and  a  =  0.05, 
which  gave  the  best  performance  of  a  number  of  manually  optimized  experiments.  The  classical 
method  was  required  to  compute  up  to  date  absorption  probability  estimates  after  every  real-world 
observation.  Prioritized  sweeping  was  allowed  five  backups  per  real  experience  =  5);  it  thus 
updated  the  7r,fc  estimates  for  the  five  highest  priority  states  between  each  real-world  observation. 
The  threshold  for  ignoring  tiny  changes,  e,  was  10“®.  Each  method  was  evaluated  at  a  number  of 
stages  of  learning,  by  stopping  the  real  time  clock  and  computing  the  error  between  the  estimated 
white-absorption  probabilities  which  we  denote  by  ff, white  and  the  true  values  which  we  denote  by 
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Figure  4:  A  500-state  Markov 
system.  Each  state  has,  on  av¬ 
erage,  5  stochastic  successors. 


WHITE*  The  following  RMS  error  over  all  states  was  recorded: 

\J ^  E&6  (’T, WHITE  -  ^.white)^  (9) 

In  Figure  5  we  look  at  the  RMS  error  plotted  against  the  number  of  observations.  After  100,000 
experiences  all  methods  are  performing  well;  TD  is  the  weakest  but  even  it  manages  an  RMS  error 
of  only  0.1. 

In  Figure  6  we  look  at  a  different  measure  of  performance:  plotted  against  real  time.  Here  we 
see  the  great  weaJmess  of  the  classical  technique.  Performing  the  Gauss-Seidel  algorithm  of  Figure  2 
after  each  observation  gives  excellent  predictions  but  is  very  time  consuming,  and  after  300  seconds 
there  has  only  been  time  to  process  a  few  thousand  observations.  After  the  same  amount  of  time, 
TD  has  had  time  to  process  almost  half  a  million  observations.  Prioritized  sweeping  performs  best 
relative  to  real  time.  It  takes  approximately  ten  times  as  long  as  TD  to  process  each  observation 
but  because  the  data  is  used  more  effectively,  convergence  is  superior. 

Ten  further  experiments,  each  with  a  different  random  500  state  problem,  were  run.  These 
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RMS  prediction  error  RMS  prediction  error 


TD 

Classical 

Pri.  Sweep 

After  100,000  observations 

0.14  ±0.077 

0.024  ±0.0063 

0.024  ±  0.0061 

After  300  seconds 

0.079  ±0.067 

0.23  ±  0.038 

0.021  ±  0.0080 

Table  1:  RMS  prediction  error:  mean  and  standard  deviation  for  ten  experiments. 

further  runs,  the  final  results  of  which  are  given  in  Table  1,  indicate  that  the  graphs  in  Figures  5 
and  6  are  not  atypical. 

This  example  has  shown  the  general  theme  of  this  paper.  Model-free  methods  perform  weU  in 
real  time  but  make  weak  use  of  their  data.  Classical  methods  make  good  use  of  their  data  but  are 
often  impracticaJly  slow.  Techniques  such  as  prioritized  sweeping  are  interesting  because  they  may¬ 
be  able  to  achieve  both. 

There  is  an  important  footnote  concerning  the  classical  method.  If  the  problem  had  only 
required  that  a  prediction  be  made  after  all  transitions  had  been  observed,  then  the  only  real  time 
cost  would  have  been  recording  the  transitions  in  memory.  The  absorption  probabilities  could  then 
have  been  computed  as  an  individual  large  computation  at  the  end  of  the  sequence,  giving  the  best 
possible  estimate  with  a  relatively  small  overall  time  cost.  For  the  500-state  problem,  we  estimate 
the  cost  as  approximately  30  seconds  for  100,000  points.  Prioritized  sweeping  could  also  benefit 
from  only  being  required  to  predict  after  seeing  all  the" data,  although  with  little  advantage  over 
the  simpler,  classical  algorithm.  Prioritized  sweeping  is  thus  most  usefully  applicable  to  the  class  of 
tasks  in  which  a  prediction  is  required  on  every  time  step.  Furthermore,  the  remainder  of  the  paper 
concerns  control  of  Maxkov  decision  tasks,  in  which  the  maintenance  of  up  to  date  predictions  is 
particularly  beneficial. 

4  Learning  Control  of  Markov  Decision  Tasks 

Let  us  consider  a  related  stochastic  prediction  problem,  which  bridges  the  gap  between  Markov 
prediction  and  control.  Suppose  the  system  gets  rewarded  for  entering  certain  states  and  punished 
for  entering  others.  Let  the  reward  of  the  ith  state  be  rj.  An  important  quantity  is  then  the 


12 


expected  discounted  reward-to-go  of  each  state.  This  is  an  infinite  sum  of  expected  future  rewards, 
with  each  term  supplemented  by  an  exponentially  decreasing  weighting  factor  7^  where  7  is  called 
the  discount  factor.  The  expected  discounted  reward^to-go  is 


Ji  =  (This  reward)+ 

7  (Expected  reward  in  1  time  step)+ 

7^  (Expected  reward  in  2  time  steps)+ 

(10) 


7*'  (Expected  reward  in  k  time  steps)4- 


For  each  i,  J,-  can  be 

computed  recursively 

as 

a  function  of  its  immediate  successors. 

Ji 

=  ri 

+7 

(9ii<^i 

+ 

9l2t/2  +  •- 

.  •  + 

J2 

=  7*2 

4-7 

(921«^2 

+ 

922*^2  +  •' 

..  + 

<l2sJs) 

(11) 


Js  —  ■'*5  +7  {qST.Jl  +  <1S2J2  +  •••  +  qssJs) 
which  is  another  set  of  linear  equations  that  may  be  solved  if  the  transition  probabilities  q,j  are 
known.  If  they  are  not  known,  but  instead  a  sequence  of  state  transitions  and  r,-  observations  is 
given,  then  slight  modifications  of  TD,  the  classical  algorithm,  and  prioritized  sweeping  can  all  be 
used  to  estimate  J,-. 

I  "  ^ 


Markov  decision  tasks 

Markov  decision  tasks  are  an  extension  of  the  Markov  model  in  which,  instead  of  passively  watching 
the  state  move  axound  randomly,  we  are  able  to  influence  it. 

Associated  with  each  state,  i,  is  a  finite,  discrete  set  of  actions,  actions(z).  On  each  time  step, 
the  controller  must  choose  an  action.  The  probabilities  of  potential  next  states  depend  not  only 
on  the  current  state,  but  also  on  the  chosen  action.  We  will  supplement  our  example  problem  with 
actions: 


actionfl(l)  =  {RANDOM,  RIGHT}  actions(3)  =  {RANDOM,  RIGHT)  actions(5)  =  {STAY} 
actions(2)  =  {RANDOM,  RIGHT}  actions{4)  =  {RANDOM,  RIGHT}  actions(6)  =  {STAY} 


(12) 
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where  random  causes  the  same  random  transitions  as  before,  right  moves,  with  probability  1,  to  the 
cell  immediately  to  the  right,  and  stay  makes  us  remain  in  the  same  state.  There  is  still  no  escape 
from  states  5  and  6. 

We  use  the  notation  g,“-  for  the  probability  that  we  move  to  state  j,  given  that  we  have  com¬ 
menced  in  state  i  and  applied  action  a.  Thus,  in  our  example  |  and  =  1. 

A  policy  is  a  mapping  from  states  to  actions.  For  example.  Figure  7  shows  the  policy 

1  —  RIGHT  3  —  RIGHT  5  —  STAY 

(13) 

2  -f  RANDOM  4  —  RANDOM  6  —  STAY 

If  the  controller  chooses  actions  according  to  a  fixed  policy  then  it  behaves  like  a  Markov  system. 
The  expected  discounted  reward-to-go  can  then  be  defined  and  computed  in  the  same  manner  as 
Equation  (11).  t 


Figure  7:  The  policy  defined  by 
Equation  (13).  Also  shown  is  a  re¬ 
ward  function  (bottom  left  of  each 
cell).  Large  expected  reward-to-go  in¬ 
volves  getting  to  ‘5’  and  avoiding  ‘6’. 


If  the  goal  is  large  reward-to-go,  then  some  policies  are  better  than  others.  An  important  result 
from  the  theory  of  Markov  decision  tasks  tells  us  that  there  always  exists  at  least  one  policy  which 
is  optimal  in  the  following  sense.  For  every  state,  the  expected  discounted  reward-to-go  using  an 
optimal  policy  is  no  worse  than  that  from  any  other  policy. 

Furthermore,  there  is  a  simple  algorithm  for  computing  both  an  optimal  policy  and  the  expected 
discounted  reward-to-go  of  this  policy.  The  algorithm  is  called  Dynamic  Programming  (Bellman  1957) 
It  is  based  on  the  following  relationship  known  as  Bellman’s  optimality  equation  which  holds  be- 
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tween  the  optimal  expected  discounted  reward-to-go  at  different  states. 


Ji  =  (r-.-  +  7(9fi  +  9“2'^2  +  •  •  •  +  qhJs)) 

a  €  actions (:) 


(14) 


Dynamic  programming  applied  to  our  example  gives  the  policy  shown  in  Figure  7,  which  happens 
to  be  the  unique  optimal  policy. 

A  very  important  question  for  machine  learning  has  been  how  to  obtain  an  optimal,  or  near 
optimal,  policy  when  the  qfj  values  are  not  known  in  advance.  Instead,  a  series  of  actions,  state 
transitions,  and  rewaxds  is  observed.  For  example: 


l(ri  =  0) 
2(r2  =  0) 
3(r3  =  0) 


RANDOM 

RANDOM 


RANDOM 


5(r5  =  10) 


6(r6  =  10) 

5(r5  =  -10) 


(15) 


A  critical  difference  between  this  problem  and  the  Markov  prediction  problem  of  the  earlier  sections 
is  that  the  controller  now  affects  which  transitions  are  seen,  because  it  supplies  the  actions. 

The  question  of  learning  such  systems  is  studied  by  the  field  of  reinforcement  learning,  which  is 
also  known  as  “learning  control  of  Markov  decision  tasks”.  Early  contributions  to  this  field  were  the 
checkers  player  of  Samuel  (1959)  and  the  BOXES  system  of  Michie  and  Chambers  (1968).  Even 
systems  which  may  at  first  appear  trivially  small,  such,  as  the  two  armed  bandit  problem  (Berry 
and  Fristedt  1985)  have  promoted  rich  and  interesting  work  in  the  statistics  community. 

The  technique  of  gradient  descent  optimization  of  neural  networks  in  combination  with  ap¬ 
proximations  to- the  policy  and  reward-to-go  (called  the  “adaptive  heuristic  critic”)  was  introduced 
by  Sutton  (1984).  Kaelbling  (1990)  introduced  several  applicable  techniques,  including  the  In¬ 
terval  Estimation  algorithm.  Watkins  (1989)  introduced  an  important  model-free  asynchronous 
Dynamic  Programming  technique  called  Q-learning.  Sutton  (1990)  has  extended  this  further  with 
the  Dyna  architecture.  Christiansen  et  al.  (1990)  applied  a  planner,  closely  related  to  Dynamic 
Programming,  to  a  tray  tilting  robot.  An  excellent  review  of  the  entire  field  may  be  found 
in  (Barto  et  al.  1991). 
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4.1  Prioritized  sweeping  for  learning  control  of  Markov  decision  tasks 

The  main  differences  between  this  case  and  the  previous  application  of  prioritized  sweeping  are 


1.  We  need  to  estimate  the  optimal  discounted  reward-to-go,  J,  of  each  state,  rather  than  the 
eventual  absorption  probabilities. 

2.  Instead  of  using  the  absorption  probability  backup  Equation  (6),  we  use  BeUman’s  equa¬ 
tion  (Bellman  1957;  Bertsekas  and  Tsitsiklis  1989): 


Ji  = 


max 


a  E  actions(t) 


^-  +  7  X  Y. 


(16) 


iESUCCS(«,a) 


where  J,-  is  the  estimate  of  the  optimal  discounted  reward  starting  from  state  i,  7  is  the 
discount  factor,  actions(i)  is  the  set  of  possible  actions  in  state  i,  and  qfj  is  the  maximum 
likelihood  estimated  probability  of  moving  from  state  i  to  state  j  given  that  we  have  applied 
action  a.  The  estimated  immediate  reward,  ff ,  is  computed  as  the  mean  reward  experienced 
to  date  during  all  previous  applications  of  action  a  in  state  i. 

3.  The  rate  of  learning  can  be  affected  considerably  by  the  controller’s  exploration  strategy. 


The  algorithm  for  prioritized  sweeping  in  conjunction  with  Bellman’s  equation  is  given  in  Fig¬ 
ure  8.  The  only  substantial  difference  between  this  algorithm  and  the  prediction  caise  is  the  state 
backup  step,  namely  the  Bellman’s  equation  application  of  Step  2.2.  Notice  also  that  the  prede¬ 
cessors  of  a  state  are  now  a  set  of  state-action  pairs. 

Let  us  now  consider  the  question  of  how  best  to  gain  useful  experience  in  a  Markov  decision 
task.  The  formally  correct  method  would  be  to  compute  that  exploration  which  maximizes  the 
expected  reward  received  over  the  robot’s  remaining  life.  This  computation,  which  requires  a 
prior  probability  distribution  over  the  space  of  Markov  decision  tasks,  is  unrealistically  expensive. 
It  is  computationally  exponential  in  all  of  (i)  the  number  of  time  steps  for  which  the  system  is 
to  remain  alive  (ii)  the  number  of  states  in  the  system,  and  (iii)  the  number  of  actions  avail¬ 
able  (Berry  and  Pristedt  1985). 

An  exploration  heuristic  is  thus  reqmred.  Kaelbling  (1990)  and  Barto  et  al  (1991)  both  give 
excellent  overviews  of  the  wide  range  of  heuristics  which  have  been  proposed. 
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1.  Promote  state  I'recent  to  top  of  priority  queue. 

2.  While  we  are  allowed  further  processiug  and  priority  queue  not  empty 

2.1  Remove  the  top  state  from  the  priority  queue.  Call  it  i 

2.2pnew:= 

o  G  actions(*)  y  jesuccs(t,a) 

2 . 3  Ajnax  ’  “  I  Pnew  Ji  I 

2.4  J{  :  “  Pzlqw 

2.5  for  each  G  preds(i) 

P  Qft i^max 

\  If  P  >  c  (a  tiny  threshold)  and  if  (if  i'  not  on 

queue  or  P  exceeds  the  current  priority  of  z')  then 
promote  i'  to  new  priority  P. 

Figure  8:  The  prioritized  sweeping  algorithm  for  Markov  Decision  Tasks. 
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We  use  the  philosophy  of  optimism  in  the  face  of  uncertainty ^  a  method  successfully  developed  by 
the  Interval  Estimation  (IE)  algorithm  of  Kaelbling  (1990)  and  by  the  exploration  bonus  technique 
in  Dyna  (Sutton  1990).  The  same  philosophy  is  also  used  by  Thrun  and  Moller  (1992). 

A  slightly  different  heuristic  is  used  with  the  prioritized  sweeping  algorithm.  This  is  because 
of  minor  problems  of  computational  expense  for  IE  and  the  instability  of  the  exploration  bonus  in 
large  state-spaces. 

The  slightly  different  optimistic  heuristic  is  as  foUows.  In  the  absence  of  contrary  evidence, 
any  action  in  any  state  is  cissumed  to  lead  us  directly  to  a  fictional  absorbing  state  of  permanent 
large  reward  The  amount  of  evidence  to  the  contrary  which  is  needed  to  quench  our  optimism 
is  a  system  parameter,  Tbored-  If  the  number  of  occurrences  of  a  given  state-action  pair  is  less 
than  Tbored)  we  assume  that  we  will  jump  to  fictional  state  with  subsequent  long  term  reward 
7-opt  ^  -y^opt  y,2^opt  ^  ^  —  7).  K  the  number  of  occurrences  is  not  less  than  Tboredv  then 

we  use  the  true,  non-optimistic,  assumption.  Thus  the  optimistic  reward-to-go  estimate  is 


r"^7(l-7)  if  n?<  Tbored 

~  max  ^  (17) 

'  a€actions{i)  rf  +  J  X  otherwise 

^  ;GSUCCS(t,a) 

where  nf  is  the  number  of  times  action  a  has  been  tried  to  date  in  state  i.  The  important 


feature,  identified  by  Sutton  (1990),  is  the  planning  to  explore  behavior  caused  by  the  appear¬ 
ance  of  the  optimism  on  both  sides  of  the  equation.^  A  related  exploration  technique  was  used 
by  (Christiansen  et  al  1990).  Consider  the  situation  in  Figure  9.  The  top  left  hand  corner  of 
state-space  only  looks  attractive  if  we  use  an  optimistic  heuristic.  The  areas  near  the  frontiers  of 
little  experience  will  have  high  and  in  turn  the  areas  near  those  have  nearly  as  high 
Therefore,  if  prioritized  sweeping  (or  any  other  asynchronous  dynamic  programming  method)  does 
its  job,  from  START  we  wiU  be  encouraged  to  go  north  towards  the  unknown  instead  of  east  to 


the  best  reward  discovered  to  date. 


The  system  parameter  does  not  require  fine  tuning.  It  can  be  set  to  a  gross  overestimate 
of  the  largest  possible  reward,  and  the  system  will  simply  continue  exploration  until  it  has  sampled 
all  state-action  combinations  Tbored  times.  However,  Section  6  discusses  its  use  as  a  search-guiding 
heuristic  similar  to  the  heuristic  at  the  heart  of  A*  search. 
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Figure  9:  The  state-space  of  a 
very  simple  path  planning  prob¬ 
lem. 


The  Tbored  parameter,  which  defines  how  often  we  must  try  a  given  state- action  combination 
before  we  cease  our  optimism,  certainly  does  require  forethought  by  the  human  programmer.  If 
too  small,  we  might  overlook  some  low  probability  but  highly  rewarding  stocha^stic  successor.  If 
too  high,  the  system  will  waiste  time  needlessly  resampling  already  reliable  statistics.  Thus,  the 
exploration  procedure  does  not  have  full  autonomy.  This  is,  arguably,  a  necessary  weakness  of 
any  non-random  exploration  heuristic.  Dyna’s  exploration  bonus  contains  a  similaj  parameter  in 
the  relative  size  of  the  exploration  bonus  to  the  expected  reward,  and  Interval  Estimation  has  the 
parameter  implicit  in  the  optimistic  confidence  level. 

The  selection  of  an  appropriate  Tbored  would  be  hard  to  formalize.  It  should  take  into  account: 
the  expected  lifetime  of  the  system,  a  measure  of  the  importance  of  not  becoming  stuck  during 
learning,  and  perhaps  any  available  prior  knowledge  of  the  stochasticity  of  the  system,  or  known 
constraints  on  the  reward  function.  An  automatic  procedure  for  computing  Tbored  would  require 
a  formal  definition  of  the  human  programmer’s  requirements  and  a  prior  distribution  of  possible 
worlds. 
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5  Experimental  Results 


This  section  begins  with  some  comparative  results  in  the  familiar  domain  of  stochastic  two  dimen¬ 
sional  maze  worlds.  It  then  examines  the  /?  parameter  which  specifies  the  amount  of  computation 
(number  of  Bellman  equation  backups)  allowed  per  real-world  observation  and  also  the  Tbored 
rameter  which  defines  how  much  exploration  is  performed.  A  number  of  larger  examples  are  then 
used  to  investigate  performance  for  a  range  of  different  discrete  stochastic  reinforcement  tasks. 

Maze  problems 

Each  state  has  four  actions:  one  for  each  direction.  Blocked  actions  do  not  move.  One  goal  state 
(the  star  in  subsequent  figures)  gives  100  units  of  reward,  all  others  give  no  reward,  and  there  is 
a  discount  factor  of  0.99.  Trials  start  in  the  bottom  left  corner.  The  system  is  reset  to  the  staJt 
state  whenever  the  goal  state  hcis  been  visited  ten  times  since  the  last  reset.  The  reset  is  outside 
the  learning  task:  it  is  not  observed  2ls  a  state  transition. 

Dyna  and  prioritized  sweeping  were  both  allowed  ten  Bellman’s  equation  backups  per  observa¬ 
tion  (/?  =  10).  Two  versions  of  Dyna  were  tested: 

1.  Dyna-PI+  is  the  original  Dyna-PI  of  Sutton  (1990),  supplemented  with  the  exploration  bonus 
(e  =  0.001)  from  the  same  paper. 

2.  Dyna-opt  is  the  original  Dyna-PI  supplemented  with  the  same  Tbored  optimistic  heuristic 
that  is  used  by  prioritized  sweeping. 

Table  2  shows  the  number  of  observations  before  convergence.  A  trial  was  defined  to  have  converged 
by  a  given  time  if  no  subsequent  sequence  of  1000  decisions  contained  more  than  2%  suboptimal 
decisions.  The  test  for  optimality  wa^  performed  by  comparison  with  the  control  law  obtained  from 
fuU  dynamic  programming  using  the  true  simulation. 

We  begin  with  some  results  for  deterministic  problems,  in  the  first  three  rows  of  Table  2.  The 
first  row  shows  that  Dyna-PI+  converged  for  all  problems  except  the  4,528  state  problem.  A  smaller 
exploration  bonus  than  e  =  0.001  might  have  helped  the  latter  problem  converge,  albeit  slowly. 
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Maize 


15  state 

117  state 

178  state 

284  state 

605  state 

2627 

4528 

Det  Dyna-PI+ 

400 

500 

10,000 

18,000 

36,000 

195,000 

>  10® 

Det  Dyna-opt 

300 

900 

4,250 

12,000 

21,000 

105,000 

245,000 

Det  PriSweep 

150 

1,200 

3,250 

2,800 

6,000 

29,000 

59,000 

Stc  Q 

600 

31,000 

62,000 

310,000 

untested 

untested 

untested 

Stc  Q~opt 

500 

>  10® 

>  10® 

untested 

untested 

untested 

untested 

Stc  Dyna*“PI+ 

400 

4750 

12,000 

25,000 

58,000 

240,000 

525,000 

Stc  Dyna-opt 

700 

5250 

7500 

14,000 

35,000 

155,000 

310,000 

Stc  PriSweep 

600 

3500 

5500 

11,000 

22,000 

94,000 

200,000 

Table  2.  Number  of  observations  before  98%  of  decisions  were  subsequently  optimal.  These  values 
have  been  rounded.  For  prioritized  sweeping  (and  Dyna,  where  applicable)  /3  =  10,  e  =  10”^  and 
7*opt  ^  200.  The  tabulated  experiments  were  all  only  run  once;  however,  further  multiple  runs  of 
the  optimistic  Dyna  and  prioritized  sweeping  have  revealed  little  variance  in  convergence  rate.  See 
also  Figures  11  and  14. 


The  other  two  rows  used  the  optimistic  heuristic  with  =  200  and  Tbored  =  1-  The  value 
thus  overestimated  the  best  possible  reward  by  a  factor  of  two — this  was  to  see  if  we  would  converge 
without  an  accurate  estimation  of  the  true  best  possible  reward.  Tbored  =  1  meant  that  as  soon  as 
something  was  tried  all  optimism  was  lost.  This  is  a  safe  strategy  in  a  deterministic  environment. 

The  learning  controller  was  given  no  clues  beyond  those  implicit  in  the  two  parameters 
and  Tbored*  Thus,  to  ensure  convergence  to  the  optimum,  it  had  to  sample  each  state-action  pair 
at  least  once. 

Prioritized  sweeping  required  fewer  steps  than  optimistic  Dyna  in  all  mazes  but  one  small  one. 
All  learners  and  runs  took  between  10 — 30  seconds  per  thousand  observations  running  on  a  Sun-4 
workstation.  Interestingly,  prioritized  sweeping  usually  took  about  half  the  real  time  of  Dyna.  This 
is  because  during  much  of  the  exploration  there  were  so  few  surprises  that  it  did  not  need  to  um 
its  full  allocation  of  BeUman’s  equation  processing  steps.  This  effect  is  even  more  pronounced  if 
300  processing  steps  per  observation  are  allowed  instead  of  ten.  For  example,  in  the  4,528  state 
problem,  optimistic  Dyna  then  required  143,000  observations  and  took  three  hours.  Prioritized 
sweeping  required  21,000  observations  and  took  fifteen  minutes. 

The  lower  part  of  Table  2  shows  the  results  for  stochastic  problems  using  the  same  mazes.  Each 
action  had  a  50%  chance  of  being  corrupted  to  a  random  value  before  it  was  applied.  Thus  if 
“North”  was  applied  the  outcome  was  movement  North  5  +  |  =  |  of  the  time,  and  each  other 
direction  |  of  the  time.  Prioritized  sweeping  and  optimistic  Dyna  each  used  a  Tbored  value  of  5. 
Thus,  they  sampled  every  state-action  combination  five  times  before  losing  their  optimism.  This 
value  was  chosen  as  a  reasonable  balance  between  exploration  and  exploitation,  given  the  authors’ 
knowledge  of  the  stochasticity  of  the  system,  and  happily  it  proved  to  be  satisfactory.  As  we 
discussed  in  Section  4.1,  the  choice  of  Tbored  is  not  automated  for  any  of  these  experiments. 

These  stochastic  results  also  include  a  recent  interesting  incremental  technique  called  Q-learning 
(Watkins  1989),  which  manages  to  learn  without  constructing  a  state  transition  model.  Addition¬ 
ally,  we  tried  Q-learning  using  the  same  Tbored  optimistic  heuristic  as  prioritized  sweeping.  The 
initial  Q  values  were  set  high  to  encourage  better  initial  exploration  than  a  random  walk.  Much 
effort  was  put  into  tuning  Q  for  this  application.  Its  performance  was,  however,  worse.  In  particu- 
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lar,  the  optimistic  heuristic  is  a  disaster  for  Q-learning  which  easily  gets  trapped — this  is  because 
Q-learning  only  pays  attention  to  the  current  state  of  the  system  while  the  ‘‘planning  to  explore” 
behavior  requires  that  attention  is  paid  to  areas  of  the  state-space  which  the  system  is  not  currently 
in. 

For  the  stochastic  maze  results  the  difference  between  optimistic  Dyna  and  prioritized  sweeping 
is  less  pronounced.  This  is  because  the  large  number  of  predecessors  quickly  dilute  the  wave  of 
interesting  changes  which  are  propagated  back  on  the  priority  queue,  leading  to  a  queue  of  many, 
very  similar,  priorities.  However,  prioritized  sweeping  still  required  less  than  half  the  total  real 
time  of  either  version  of  Dyna  before  convergence. 

A  small,  fully  connected,  example 

We  also  have  results  for  a  five  state  bench-mark  problem  described  by  Sato  et  ai  (1988)  and 
also  used  in  Barto  and  Singh  (1990).  The  transition  matrix  is  in  Figure  10  and  the  results  are 
shown  in  Table  3.  A  Tbored  parameter  of  20  was  used.  In  fact,  Tbored  =  5  also  converged  20 
times  out  of  20,  taking  on  average  120  steps  and  therefore  Tbored  =  20  was  considered  a  safe 
safety  margin.  The  two  Q-learners  were  heavily  tweaked  to  find  their  best  performance.  The 
EQ-algorithm  (Barto  and  Singh  1990)  is  designed  to  guarantee  convergence  at  all  costs — and  so  its 
poor  comparative  performance  here  is  to  be  expected.  Dyna-PI+  was  given  what  was  probably  too 
small  an  exploration  bonus  for  the  problem.  The  reduced  exploration  meant  faster  convergence, 
but  on  one  occasion  some  misleading  early  transitions  caused  it  to  get  stuck  with  a  suboptimal 
policy. 

The  system  parameters  for  prioritized  sweeping 

We  now  look  at  two  results  to  give  insight  into  two  important  parameters  of  prioritized  sweeping. 
Firstly  we  consider  its  performance  relative  to  the  number  of  backups  per  observation.  This  exper¬ 
iment  used  the  stochastic,  605  state  example  from  Table  2  and  the  results  are  graphed  in  Figure  11. 
Using  one  operation  is  almost  equivalent  to  optimistic  Q-learning  which  does  not  converge.  Even 
using  only  two  backups  gives  reasonable  performance,  and  performance  improves  as  the  number  of 
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Table  3-  The  mean  number  of  observations  before  >  98%  of  subsequent  decisions  were  optimal. 
Each  learner  wcis  run  twenty  times  and  in  all  cases,  bar  one,  there  was  eventual  convergence  to 
optimal  performance.  Also  shown  is  the  standard  deviation  of  the  twenty  trials.  The  discount 
factor  was  7  =  0.8.  For  the  optimistic  methods  =  10  and  Tbored  =  20.  For  prioritized  sweeping 
and  Dyna  =  10,  and  for  prioritized  sweeping  e  =  10”^. 


(xlOO)  and  expected  rewajds  of  a  five  needed  for  prioritized  sweeping  to  con¬ 


state,  three  action,  Markov  control  verge, 'plotted  against  number  of  back- 
problem.  ups  per  observation  (/?).  This  used  the 

605  state  stochastic  maize  from  Table  2 
(7  =  0.99,  r°P‘  =  200,  Tbored  =  5. 
-  e  =  10*“^).  The  error  bars  show  the 

standard  deviations  from  ten  runs  with 
different  random  seeds. 


backups  increases.  Beyond  fifty  backups,  the  priority  queue  usually  gets  exhausted  on  each  time 
step,  and  there  is  little  further  improvement. 

The  other  parameter  is  Tbored-  We  use  a  test  case  in  which  inadequate  exploration  is  particularly 
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dangerous.  The  maze  in  Figure  12  has  two  reward  states.  The  lesser  reward  of  50  comes  from  the 
state  in  the  bottom  right.  The  greater  reward  of  100  is  from  the  more  inaccessible  state  near  the 
top  right.  Trials  always  begin  from  the  bottom  left  and  the  world  is  stochastic  in  the  same  manner 
as  the  earlier  examples.  Trials  are  reset  when  either  goal  state  is  encountered  ten  times.  K  Tbored 
is  set  too  low  and  if  there  is  bad  luck  while  attempting  to  explore  near  the  large  reward  state  then 
the  controller  will  lose  interest,  never  return,  and  very  likely  spend  the  rest  of  its  days  traveling 
to  the  inferior  reward.  Each  value  of  Tbored  was  run  ten  times  and  we  recorded  the  percentage  of 
runs  which  had  converged  correctly  by  50,000  observations.  Figure  13  graphs  the  results.  For  this 
problem  Tbored  =  5  (which  was  checked  a  further  30  times)  appears  sufficient  to  ensure  that  we  do 
not  become  stuck. 


Figure  12:  A  misleading  maze. 

A  small  reward  in  the  bottom 
right  tempts  us  away  from  a 
larger  reward. 


Figure  14  shows  the  number  of  experiences  needed  for  convergence  a  function  of  Tbored  for 
the  same  set  of  experiments. 

Other  tasks 

We  begin  with  a  task  with  a  3-d  state-space  quantized  into  14,400  potential  discrete  states i  guiding 
a  rod  through  a  planar  maze  by  translation  and  rotation.  There  are  four  actions:  move  forwards 
one  unit  along  the  rod’s  length,  move  backwards  one  unit,  rotate  left  one  unit  and  rotate  right  one 


25 


Figure  13:  The  frequency  of  correct  con¬ 
vergence  versus  T^or^  mislezui- 

ing  mdize  (7  =  0.99,  =  200,  ^  =  10, 

e=:10~^). 


Figure  14:  The  mean  and  standard  de¬ 
viation  number  of  experiences  before 
convergence  for  ten  independent  exper¬ 
iments,  as  a  function  of  for  the 

misleading  maze.  Parameter  values  are 
as  in  Figure  13. 


unit.  In  fact,  the  action  takes  us  to  the  nearest  quantized  state  after  having  applied  the  action. 
There  are  20x  20  position  quantizations  and  36  angle  quantizations  producing  14,400  states,  though 
many  are  unreachable  from  the  start.  The  distance  unit  is  l/20th  the  width  of  the  workspace  and 
the  angular  unit  is  10  degrees.  The  problem  is  deterministic  but  requires  a  long,  very  specific, 
sequence  of  moves  to  get  to  the  goal.  Figure  15  shows  the  problem,  obstacles  and  shortest  solution 
for  our  experiments. 

Q,  Dyna-PI+^  Optimistic  Dyna  and  prioritized  sweeping  were  all  tested.  The  results  are  in 
Table  4. 

Q  and  Dyna-PI+  did  not  even  travel  a  quarter  of  the  way  to  the  goal,  let  alone  discover  an 
optimal  path,  within  200,000  experiences.  It  is  possible  that  a  very  weU-chosen  exploration  bonus 
would  have  helped  Dyna“PI+  but  in  the  four  different  experiments  we  tried,  no  value  produced 
stable  exploration. 

Optimistic  Dyna  and  prioritized  sweeping  both  eventually  converged,  with  the  latter  requiring 
a  third  the  experiences  and  a  fifth  the  real  time. 

When  2000  backups  per  experience  were  permitted,  instead  of  100,  then  both  optimistic  Dyna 
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Figure  15;  A  three-DOF  prob¬ 
lem  and  the  shortest  solution 
path. 


Experiences  to  converge 

Real  time  to  converge 

Q 

never 

Dyiia-PI+ 

never 

Optimistic  Dyna 

55,000 

1500  secs 

Prioritized  Sweeping 

14,000 

330  secs 

Table  4.  Performance  on  the  deterministic  rod-in-maze  task.  Both  Dynas  and  prioritized  sweeping 
were  allowed  100  backups  per  experience  (7  =  0.99, =  200,/?  =  lOOjTbored  =  l,e  =  10“^). 
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and  prioritized  sweeping  required  fewer  experiences  to  converge.  Optimistic  Dyna  took  21,000 
experiences  instead  of  55,000  but  took  2,900  seconds— almost  twice  the  real  time.  Prioritized 
sweeping  took  13,500  instead  of  14,000  experiences — very  little  improvement,  but  it  used  no  extra 
time.  This  indicates  that  for  prioritized  sweeping,  100  backups  per  observation  is  sufficient  to  make 
almost  complete  use  of  its  observations,  so  that  aU  the  long  term  reward  (7,)  estimates  are  very 
close  to  the  estimates  which  would  be  globally  consistent  with  the  transition  probability  estimates 
(4fj)-  Thus,  we  conjecture  that  even  full  dynamic  programming  after  each  experience  (which  would 
take  days  of  real  time)  would  do  little  better. 

We  also  consider  a  more  complex  extension  of  the  maze  world,  invented  by  Singh  (1991),  which 
consists  of  a  maze  and  extra  state  information  dependent  on  where  you  have  visited  so  far  in  the 
maze.  We  use  the  example  in  Figure  16.  There  axe  263  cells,  but  there  are  also  four  binary  fla^ 
appended  to  the  state,  producing  a  total  of  263  x  16  =  4208  states.  The  flags,  named  A,  B,  C 
and  X,  axe  set  whenever  the  cell  containing  the  corresponding  letter  is  pa.ssed  through.  All  flags 
are  cleared  when  the  start  state  (in  the  bottom  left  hand  corner)  is  entered.  A  reward  is  given 
when  the  goal  state  (top  right)  is  entered,  only  if  flags  A,  B  and  C  are  set.  Flag  X  provides  further 
interest.  K  X  is  clear,  the  reward  is  100  units.  K  X  is  set,  the  reward  is  only  50  units.  This  task 
does  not  specify  which  order  A,  B  and  C  are  to  be  visited.  The  controller  must  find  the  optimal 
path. 

Prioritized  sweeping  was  tried  with  both  the  deterministic  and  stochastic  maze  dynamics  (7  = 
0.99,  r°P‘  =  200,/?  =  10,  c  =  10“^).  In  the  deterministic  case  Tbored  =  1.  In  the  stochastic  case 
Tbored  =  5.  In  both  cases  it  found  the  globally  optimal  path  through  the  three  good  flags  to  the 
goal,  avoiding  flag  X.  The  deterministic  case  took  19,000  observations  and  twenty  minutes  of  real 
time.  The  stochastic  case  required  120,000  observations  and  two  hours  of  real  time. 

In  these  experiments,  no  information  regarding  the  special  structure  of  the  problem  was  available 
to  the  learner.  For  example,  knowledge  of  the  ceU  at  coordinates  (7, 1)  with  flag  A  set  had  no  bearing 
on  knowledge  of  the  ceU  at  coordinates  (7, 1)  with  A  clear.  If  we  told  the  learner  that  ceU  transitions 
are  independent  of  flag  settings  then  the  convergence  rate  would  be  increased  considerably.  A  fax 
more  interesting  possibility  is  the  automatic  discovery  of  such  structure  by  inductive  inference  on 
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Figure  16:  A  maze  with  extra 
state  in  the  form  of  four  binary 
flags. 


the  structure  of  the  leaxned  state  transition  matrix.  See  Singh  (1991)  for  current  interesting  work 
in  that  direction. 

The  third  experiment  is  the  familiar  pole-balancing  problem  of  Michie  and  Chambers  (1968). 
There  is  no  place  here  to  discuss  the  enormous  number  of  techniques  which  have  been  applied  to 
this  problem  along  with  an  equally  enormous  variation  in  details  of  the  task  formulation.  The  state- 
space  of  the  cart  is  quantized  at  three  equal  levels  for  cart  position,  cajt  velocity,  and  pole  angular 
speed.  It  is  quantized  at  six  equal  levels  for  pole  angle.  The  simulation  used  four  real-valued  state 
variables,  yet  the  learner  was  only  allowed  to  base  its  control  decisions  on  the  current  quantized 
state.  There  are  two  actions:  thrust  left  lOA^  and  thrust  right  lOiV.  The  problem  is  interesting 
because  it  involves  hidden  state — the  controller  believes  the  system  is  Markov  when  in  fact  it  is  not. 
This  is  because  there  are  many  possible  values  for  the  real- valued  state  variables  in  each  discretized 
box,  and  successor  boxes  are  partially  determined  by  these  real  values,  which  are  not  given  to 
the  controller.  The  task  is  defined  by  a  reward  of  100  units  for  every  state  except  one  absorbing 
state  corresponding  to  a  crash,  which  receives  zero  reward.  Crashes  occur  if  the  pole  angle  or  cart 
position  exceed  their  limits.  A  discount  factor  of  7  =  0.999  is  used  and  trials  start  in  random 
survivable  configurations.  Other  parameters  are  (r°P‘  =  200,  f3  =  lOO.Tbored  =  1,€  =  10“^). 

If  the  simulation  contains  no  noisej  or  a  very  small  amount  (0.1%  added  to  the  simulated  thrust). 


prioritized  sweeping  very  quickly  (usually  in  under  1000  observations  and  15  crashes)  develops  a 
policy  which  provides  stability  for  approximately  100,000  cycles.  With  a  small  amount  of  noise 
(1%),  stable  runs  of  approximately  20,000  time  steps  are  discovered  after,  on  average,  30  crashes. 

6  Heuristics  to  Guide  Search 

In  aU  experiments  to  date,  the  optimistic  estimate  of  the  best  available  one-step  reward,  r°P‘,  has 
been  set  to  an  overestimate  of  the  best  reward  which  is  actually  available.  However,  if  the  human 
programmer  knows  in  advance  what  is  the  best  possible  reward-to-go  from  any  given  state,  then 
the  resultant,  more  realistic,  optimism  does  not  need  to  experience  all  state-action  pairs. 

For  example,  consider  the  maze  world.  If  the  robot  is  told  the  location  of  the  goal  state  (in  aH 
previous  experiments  it  was  not  given  this  information),  but  is  not  told  which  states  are  blocke'l, 
then  it  can  nevertheless  compute  what  would  be  the  best  possible  reward-to-go  from  a  state.  It 
could  not  be  greater  than  the  reward  obtained  from  the  shortest  possible  path  to  the  goal.  The 
length  of  the  path,  I,  can  be  computed  easily  with  the  Manhattan  distance  metric  and  then  the 
best  possible  reward-to-go  is 

opt  / 

O7I  +  07^  -1- . . .  -b  07'“^  +  r°P‘7'  +  r°P‘7'+i  -h  . . .  =  (18) 

1  —  7 

When  this  optimistic  heuristic  is  used,  initial  exploration  is  biased  towards  the  goal,  and  once  a 
path  is  discovered  then  many  of  the  unexplored  areas  may  be  ignored.  Ignoring  occurs  when  even 
the  most  optimistic  reward-to-go  of  a  state  is  no  greater  than  that  of  the  already  obtained  path. 

For  example,  Figtire  17  shows  the  areas  explored  using  a  Manhattan  heuristic  when  finding  the 
optimal  path  from  the  start  state  at  the  bottom  leftmost  cell  to  the  goal  state  at  the  center  of  the 
maze.  The  maze  has  8525  states  of  which  only  1722  needed  to  be  explored. 

For  some  tasks  we  may  be  satisfied  to  cease  exploration  when  we  have  obtained  a  solution 
known  to  be,  say,  within  50%  of  the  optimal  solution.  This  can  be  achieved  by  using  a  heuristic 
which  lies:  it  tells  us  that  the  best  possible  reward-to-go  is  that  of  a  path  which  is  twice  the  length 
of  the  true  shortest  possible  path. 
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Figure  17:  Dotted  states 
are  all  those  visited  when 
the  Manhattan  heuristic  was 
used  to  derive  (7  = 

0.99,/?  =  10,rbored=  = 
10-3). 


7  Discussion 


Generalization  of  the  state  transition  model 

This  paper  has  been  concerned  with  discrete  state  systems  in  which  no  prior  assumptions  are  made 
about  the  structure  of  the  state-space.  Despite  the  weakness  of  the  assumptions,  we  can  successfully 
learn  large  stochastic  tasks.  However,  very  many  problems  do  have  extra  known  structure  in  the 
state-space,  and  it  Js  important  to  consider  how  this  knowledge  can  be  used.  By  far  the  most 
common  knowledge  is  smoothness — given  two  states  which  are  in  some  way  similar,  in  general 
their  transition  probabilities  will  be  similar. 

TD  can  also  be  applied  to  highly  smooth  problems  using  a  parametric  function  approximator 
such  as  a  neural  network.  This  technique  has  recently  been  used  successfully  on  a  large  complex 
problem,  Backgammon,  by  Tesauro  (1991).  The  discrete  version  of  prioritized  sweeping  given  in 
this  paper  could  not  be  applied  directly  to  Backgammon  because  the  game  has  10^^  states,  which 
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is  unmanageably  large  by  a  factor  of  at  least  10^°.  However,  a  method  which  quantized  the  space 
of  board  positions,  or  used  a  more  sophisticated  smoothing  mechanism,  might  conceivably  be  able 
to  compute  a  near-optimal  strategy. 

We  are  currently  developing  memory-based  algorithms  which  take  advantage  of  local  smoothness 
assumptions.  In  these  investigations,  state  transition  models  are  learned  by  memory-based  func¬ 
tion  approximators  (Moore  and  Atkeson  1992).  Prioritized  sweeping  takes  place  over  non-uniform 
tessellations  of  state-space,  partitioned  by  variable  resolution  M-trees  (Moore  1991).  We  are  also 
investigating  the  role  of  locally  linear  control  rules  and  reward  functions  in  such  partitionings,  in 
which  instead  of  using  Bellman’s  Equation  (16)  directly,  we  use  local  linear  quadratic  regulators 
(LQR)  (see,  for  example,  Sage  and  White  (1977)).  It  is  worth  remembering  that,  if  the  system  is 
sufficiently  linear,  LQR  is  an  extremely  powerful  technique.  In  a  pole  balancer  experiment  in  whiih 
we  used  local  weighted  regression  to  identify  a  local  linear  model,  LQR  was  able  to  create  a  stable 
controller  based  on  only  31  state  transitions! 

Other  current  investigations  which  attempt  to  perform  generalization  in  conjunction  with  re¬ 
inforcement  learning  are  Mahadevan  and  ConneU  (1990)  which  investigates  clustering  parts  of  the 
policy.  Chapman  and  Kaelbling  (1990)  which  investigates  automatic  detection  of  locally  relevant 
state  variables,  and  Singh  (1991)  which  considers  how  to  automatically  discover  the  structure  in 
tasks  such  as  the  multiple-flags  example  of  Figure  16. 

7.1  Related  work 

The  Dyna-Q-queue  algorithm  of  Peng  and  Williams 

Peng  and  Williams  (1992)  have  concurrently  been  developing  a  closely  related  algorithm  which  they 
call  Dyna-Q-queue.  This  conceptually  similar  idea  was  discovered  independently.  Where  prioritized 
sweeping  provides  efficient  data  processing  for  methods  which  learn  the  state  transition  model, 
Dyna-Q-queue  performs  the  same  role  for  Q-learning  (Watkins  1989),  an  algorithm  which  avoids 
building  an  explicit  state-transition  model.  Dyna-Q-queue  is  also  more  careful  about  what  it  allows 
onto  the  priority  queue:  it  only  allows  predecessors  which  have  a  predicted  change  (“interestingness” 


value)  greater  than  a  significant  threshold  6,  whereas  prioritized  sweeping  allows  everything  above 
a  minuscule  change  (c  =  10~®  times  the  maximum  reward)  onto  the  queue.  The  initial  experiments 
in  Peng  and  Williams  (1992)  consist  of  sparse,  deterministic  maze  worlds  of  several  hundred  cells. 
Performance,  meeisured  by  total  number  of  Bellman’s  equation  processing  steps  before  convergence, 
is  greatly  improved  over  conventional  Dyna-Q  (Sutton  1990). 

Other  related  work 

Sutton  (1990)  identifies  reinforcement  learning  with  asynchronous  dynamic  programming  and  in¬ 
troduces  the  same  computational  regime  as  that  used  for  prioritized  sweeping.  The  notion  of  using 
an  optimistic  heuristic  to  guide  search  goes  back  to  the  A*  tree  search  algorithm  Nilsson  (1971), 
which  also  motivated  another  aspect  of  prioritized  sweeping:  it  too  schedules  nodes  to  be  expanded 
according  to  an  (albeit  different)  priority  measure.  More  recently  Korf  (1990)  gives  a  combination 
of  A*  and  Dynamic  Programming  in  the  LRTA*  algorithm.  LRTA*  is,  however,  very  different  from 
prioritized  sweeping;  it  concentrates  all  search  effort  in  a  finite-horizon  set  of  states  beyond  the 
current  actual  system  state.  Finally,  Lin  (1991)  has  investigated  a  simple  technique  which  replays, 
backwards,  the  memorized  sequence  of  experiences  which  the  controller  has  recently  had.  Under 
some  circumstances  this  may  produce  some  of  the  beneficial  effects  of  prioritized  sweeping. 

8  Conclusion 

Our  investigation:  shows  that  prioritized  sweeping  can  solve  large  state-space  real  time  problems 
with  which  other  methods  have  difficulty.  Other  benefits  of  the  memory-based  approach,  described 
in  Moore  and  Atkeson  (1992),  allow  us  to  control  forgetting  in  potentially  changeable  environments 
and  to  automatically  scale  state  variables.  Prioritized  sweeping  is  heavily  based  on  learning  a  world 
model  and  we  conclude  with  a  few  words  on  this  topic. 

If  a  model  of  the  world  is  not  known  to  the  human  programmer  in  advance  then  an  adaptive 
system  is  required,  and  there  are  two  alternatives: 
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Learn  a  model  and  from 

Learn  a  control  rule 

this  develop  a  control 

without  building  a 

rule. 

model. 

Dyna  and  prioritized  sweeping  fall  into  the  first  category.  Temporal  difiFerences  and  Q-learning 
fall  into  the  second.  Two  motivations  for  not  learning  a  model  are  (i)  the  interesting  fact  that 
the  methods  do,  nevertheless,  learn,  and  (ii)  the  possibility  that  this  more  accurately  simulates 
some  kinds  of  biological  learning  (Sutton  and  Barto  1990).  However,  a  third  advantage  which  is 
sometimes  touted — that  there  are  computational  benefits  in  not  learning  a  model — is,  in  our  view, 
dubious.  A  common  argument  is  that  with  the  real  world  available  to  be  sensed  directly,  why  should 
we  bother  with  less  reliable,  learned  internal  representations?  The  counterargument  is  that  even 
systems  acting  in  real  time  can,  for  every  one  real  experience,  sample  millions  of  mental  experience 
from  which  to  make  decisions  and  improve  control  rules. 

Consider  a  more  colorful  example.  Suppose  the  anti-model  argument  was  applied  by  a  new 
arrival  at  a  university  campus:  “I  don’t  need  a  map  of  the  university — the  university  is  its  own 
map.”  If  the  new  arrival  truly  mistrusts  the  university  cartographers  then  there  might  be  an 
argument  for  one  full  exploration  of  the  campus  in  order  to  create  their  own  map.  However, 
once  this  map  has  been  produced,  the  amount  of  time  saved  overall  by  pausing  to  consult,  the 
map  before  traveling  to  each  new  location — rather  than  exhaustive  or  random  search  in  the  real 
world — ^is  undeniably  enormous.  ' 

It  is  justified  to  complain  about  the  indiscriminate  use  of  combinatorial  search  or  matrix  in¬ 
version  prior  to  each  supposedly  real  time  decision.  However,  models  need  not  be  used  in  such  an 
extravagant  fashion.  The  prioritized  sweeping  algorithm  is  just  one  example  of  a  clciss  of  algorithms 
which  can  easily  operate  in  real  time  and  also  derive  great  power  from  a  model. 
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Appendix  A.  The  random  generation  of  a  stochastic  problem 

Here  is  an  algorithm  to  generate  stochastic  systems  such  as  Figure  4  in  Section  3.  The  parameters 
are:  Snt,  the  number  of  non-terminal  states;  St  ,  the  number  of  terminal  states  and  fiances,  the 
mean  number  of  successors. 

AH  states  have  a  position  within  the  unit  square.  The  terminal  states  are  generated  on  an 
equispaced  circle,  diameter  0.9,  alternating  between  black  and  white.  Non-terminal  states  are  each 
positioned  in  a  uniformly  random  location  within  the  square.  Then  the  successors  of  each  non¬ 
terminal  are  selected.  The  number  of  successors  is  chosen  randomly  a.s  1  -f-  X  where  X  is  a  random 
variable  drawn  from  the  exponential  distribution  with  mean  fiances  —  !• 

The  choice  of  successors  is  affected  by  locality  within  the  unit  square.  This  provides  a  more 
interesting  system  than  allowing  successors  to  be  entirely  random.  It  was  empirically  noted  that 
entirely  random  successors  cause  the  long-term  absorption  probabilities  to  be  very  similar  across 
most  of  the  set  of  states.  Locality  leads  to  a  more  varied  distribution. 

The  successors  are  chosen  according  to  a  simple  algorithm  in  which  they  are  drawn  from  within 
a  slowly  growing  circle  centered  on  the  parent  state. 

If  the  parent  state  is  i,  and  there  are  iV,-  successors,  then  the  jth  transition  probability  is 
computed  by  X)/  Xk  where  {Xi , . . . ,  X^J  are  independent  random  variables,  uniformly  dis¬ 
tributed  in  the  unit  interval. 

Once  the  system  has  been  generated,  a  check  is  performed  that  the  system  is  absorbing — all 
non- terminals  can  eventually  reach  at  least  one  terminal  state.  If  not,  entirely  new  systems  are 
randomly  generated  until  an  absorbing  Markov  system  is  obtained. 
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